**Texture and Colour in Image Analysis**

## Edited by

Francesco Bianconi, Antonio Fernández and Raúl E. Sánchez-Yáñez

Printed Edition of the Special Issue Published in *Applied Sciences*

www.mdpi.com/journal/applsci

## **Texture and Colour in Image Analysis**

## **Texture and Colour in Image Analysis**

Editors

**Francesco Bianconi Antonio Fern´andez Ra ´ul E. S ´anchez-Y´a ˜nez**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Francesco Bianconi Department of Engineering Universita degli Studi di Perugia ` Perugia Italy Antonio Fernandez ´ School of Industrial Engineering Universidade de Vigo Vigo Spain Raul E. S ´ anchez-Y ´ a´nez ˜ Department of Electronic Engineering Universidad de Guanajuato Salamanca Mexico

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Applied Sciences* (ISSN 2076-3417) (available at: www.mdpi.com/journal/applsci/special issues/ texture colour image analysis).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-1378-2 (Hbk) ISBN 978-3-0365-1377-5 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


Reprinted from: *Applied Sciences* **2019**, *9*, 2418, doi:10.3390/app9122418 . . . . . . . . . . . . . . . **125**

## **Yang Liu, Ke Xu and Jinwu Xu**


## **About the Editors**

#### **Francesco Bianconi**

Francesco Bianconi received his MEng from the University of Perugia, Italy, and his PhD in computer-aided design from a consortium of Italian universities. He has been a visiting researcher with the University of Vigo, Spain; the University of East Anglia, U.K.; Queen Mary University of London, U.K.; and City, University of London, U.K. He is currently an associate professor with the Department of Engineering, University of Perugia, where he conducts research on computer vision, image processing, and pattern recognition, with special focus on texture and colour analysis for industrial and biomedical applications. Prof. Bianconi is an IEEE senior member, chartered engineer, and court-appointed expert; has been a TPC/IPC member of more than 30 international conferences and symposia; and is currently a member of the editorial board of two scholarly journals.

#### **Antonio Fern´andez**

Antonio Fernandez graduated as an industrial engineer (equivalent to an MEng degree) in 1993 ´ and obtained his PhD in 1998 from the University of Vigo, Spain, with a thesis entitled "Development of pulsed TV-holography techniques for the analysis of transient wave propagation in mechanical parts". He joined the Applied Physics Department at the University of Vigo in 1993 and then the Department of Engineering Design as an associate lecturer. He is now a senior lecturer in the same department. Prof. Fernandez teaches Engineering Graphics, Computer Programming using Python, ´ and Image Processing. His research interests include computer vision, image processing, and pattern recognition, with a special focus on texture analysis.

#### **Ra ´ul E. S ´anchez-Y ´a ˜nez**

Raul E. S ´ anchez-Y ´ a´nez is a doctor of science (optics), concluding his studies at the Centro de ˜ Investigaciones en Optica (Optical Research Center, CIO) in Le ´ on, Mexico, in 2002. He is also a master ´ of electrical engineering and has a BEng in electronics, with both degrees received from the University of Guanajuato at Salamanca (Mexico), where he has been a full time professor since 2003. His research interests include colour and texture analysis for computer vision tasks, and computational intelligence applications in feature extraction and decision making.

## **Preface to "Texture and Colour in Image Analysis"**

Texture and colour are optical stimuli that determine, to a great extent, the visual perception of objects, materials and scenes. It is no surprise, then, that texture and colour have received a great deal of research attention for at least forty years. The aptitude to process these stimuli in an effective way indeed plays a major role in the interaction between intelligent beings and the environment in which they live. Consequently, the ability to reproduce this behaviour within intelligent machines is fundamental in a wide range of applications: product inspection, object recognition, materials classification, computer-assisted medical image analysis, content-based image retrieval and remote sensing are just some examples.

In recent times, research in this topic has experienced significant changes. While the hand-crafted approach was the leading strategy up until not long ago, nowadays, Deep Learning has become the major focus. This book collects 16 technical contributions to the field (plus one editorial) from highly reputable researchers from around the world. The papers are grouped by subject and presented in the following order: Theory (1–4), Applications (5–14), Benchmarks and Comparative Evaluations (15), and Reviews (16).

> **Francesco Bianconi, Antonio Fern´andez, Ra ´ul E. S´anchez-Y´a ˜nez** *Editors*

## *Editorial* **Special Issue Texture and Color in Image Analysis**

**Francesco Bianconi 1,\* , Antonio Fernández <sup>2</sup> and Raúl E. Sánchez-Yáñez <sup>3</sup>**


## **1. Introduction and Background**

Texture and color are two types of visual stimuli that determine, to a great extent, the appearance of objects, materials, and scenes. The ability to process these stimuli enables humans and animals to interact with the environment they live in. As a consequence, texture and color have attracted a lot of research interest since early on. Color and texture analyses are also central to a wide range of applications including materials classification, surface inspection and grading, object recognition, biometric identification, content-based multimedia retrieval, remote sensing, and medical image analysis. In recent years, the appearance of new methodologies (notably deep learning) has elicited renewed interest toward the field. In this context, the objective of this Special Issue is to provide a forum for scientists and practitioners to discuss strategies, challenges, and perspectives in the discipline. The response of the community was substantial, which again confirms the interest in the topics; altogether, we received 26 contributions, of which 16 were deemed suitable for publication after peer review.

**Citation:** Bianconi, F.; Fernández, A.; Sánchez-Yáñez, R.E. Special Issue Texture and Color in Image Analysis. *Appl. Sci.* **2021**, *11*, 3801. https:// doi.org/10.3390/app11093801

Received: 1 April 2021 Accepted: 12 April 2021 Published: 22 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**2. Theory**

Four papers treated theoretical aspects of image processing, and two of them [1,2] were focused on color texture analysis. Navarro and Perez [1] introduced a method for pattern classification through color and texture features based on image partition. Their approach computes global and local features from different areas of the input image. Each pixel is represented as a quaternion and the color features are collected in a histogram obtained using Binary Quaternion Moment Preserving (BQMP). Textural information is extracted via Haralick's features from each partition to conform a feature vector, and a joint color–texture representation is obtained by merging the color code and a normalized texture descriptor. Smeraldi et al. [2] proposed a novel framework for color texture analysis based on partial orderings. Partial orders (PO) make it possible to compare multivariate data that, like colors, lack a natural order. In their work, the authors defined a general approach to extract rank features in color spaces via PO. They also extended a classical descriptor (the Texture Spectrum) to work with partial orders and showed that the partial-order version in color space outperformed the original grayscale descriptor.

Zhang et al. [3] presented a method for the identification of tampered images. In their solution, the input image is first filtered with Local Tchebichef Moments (LTM); then, the result is subtracted from the original image to obtain the 'residuals'. An error-correcting output code based on ensemble learning eventually classifies the images as tampered or not tampered.

In [4] Tang et al. addressed the problem of training Convolutional Neural Networks (CNN) and introduced a novel bounded scheduling procedure called Bsadam. The method first searches the upper and lower bound for Adam, then splits the training process into three steps: (1) the minimization step, (2) the convergence step, and (3) the uniform scaling step. The proposed solution was effectively tested with simple neural networks, deep convolution networks, and recurrent networks for image classification and language modeling tasks.

#### **3. Applications**

Biomedical image analysis received much attention in this Special Issue with a total of four papers accepted. In [5], González-Patiño et al. addressed segmentation of mammograms as an optimization problem and considered three metaheuristic approaches: simulated annealing, genetic algorithms, and a bat algorithm. They used Dunn index as the fitness function to evaluate segmented regions, which were characterized by clinical data, intensity, texture, and shape descriptors. Then, for the diagnosis of breast cancer lesions, they proposed a new artificial immune system (AIS). The performance of the metaheuristic algorithms was compared to intensity-based segmentations obtained using the Otsu method, and the outcomes of the AIS were evaluated on six datasets. Bhattacharjee et al. [6] investigated automated grading of prostate cancer from histology images. Their method is based on four steps: (1) segmentation of the input images via *k*-means; (2) separation of the touching cells through watershed transform; (3) extraction of morphological features; (4) SVM-based classification into four Gleason grade groups—grade 3, grade 4, grade 5, and benign.

Bontozoglou and Xiao [7] explored assessing a person's condition from capacitive images of their hair and skin. Concretely, they attempted to determine whether a capacitive imaging sensor in combination with image processing algorithms such as gradient-based segmentation, gray level co-occurrence matrix, and normalized cross-correlation could be used in different hair and skin analysis tasks that are of great interest to the cosmetic and pharmaceutic industries, namely, the detection of skin polygons, the estimation of the bounding wrinkles length, and the observation of hair water sorption capabilities. The experimental results indicate that the proposed approach can be successful for detecting and tracking skin artifacts (e.g., wrinkles, moles, or scars) as well as skin age classification. Evidence indicates that capacitive imaging can also be applied to hair water loss studies.

In [8], Obuchowicz et al. examined whether additional digital intraoral radiography (DIR) image preprocessing based on texture analysis improves the recognition and differentiation of periapical lesions. They applied several texture models such as co-occurrences, first-order features, run-length matrices, gray-tone difference matrices, and local binary patterns to transform DIR images into feature maps. To improve the recognition of osteolytic and sclerotic lesions, the feature maps were further processed through *k*-means clustering. The ability of the proposed approach to yield information about the shape of a structure, its pattern, and adequate contrast was validated by two radiologists independently. The experimental results showed that the application of feature mapping to radiographic dental images constitutes a promising tool for the refinement and possible differentiation of periapical lesions.

Three papers investigated industrial applications. Furferi et al. [9] presented a computer vision system for counting small metal parts produced by electrodeposition. This manufacturing procedure is common in the fashion field and, since the raw materials are usually gold and silver, it is of paramount importance to reduce the amount of waste. The devised method employs a combination of image thresholding and morphological operations. Liu et al. [10] investigated online defect detection in the production of steel plates. This is a fairly common problem in the industry, and requires both speed and high recognition accuracy. The proposed solution relies on Multiblock Local Binary Patterns (MB-LBP), which the authors found to be superior to other methods such as the Gray-Level Co-occurrence Matrix (GLCM), the Scale-Invariant Feature Transform (SIFT), and the speeded up robust feature (SURF). Geng et al.'s work [11] is concerned with the problem of measuring the period length and the skew angle patterns of textile cutting pieces. This kind of semifinished product has been widely used in car seats and garment production. Experimenting on a dataset of 5000 images, the authors demonstrated the suitability of a regional convolutional neural network (R-CNN) for the task.

Two papers addressed remote sensing problems. Wang et al. [12] described a technique to accurately identify maceral components in the fields of mining and geology. The correct identification of such components is central to a number of industrial processes such as hydrogenation, combustion, carbonization, and gasification. The proposed method employs a two-level coarse-to-fine clustering procedure to divide microscopic images into a sequence of regions with similar attributes (i.e., binder, vitrinite, liptinite, and inertinite). Yu et al. [13] addressed the problem of image segmentation of river scenes. To this end, they proposed a novel approach based on a reflection mechanism of the water surface. Their method employs a Multiblock Local Binary Patterns texture and hue variance in the HSI color space to detect the shadow area of the water's surface. A morphological operation with multiple dilation was employed to reduce false positives due to pseudo-water-patches.

The work by Nanni et al. [14] considered quite an original case study, that is, the automated classification of animal audio. For this task, the authors proposed the use a combination of Siamese neural network and different clustering techniques to train a support vector classifier.

#### **4. Benchmarks and Comparative Evaluations**

Using handcrafted features as visual descriptors has been the dominant paradigm in computer vision for many years. In the last decade however, consequently with the extraordinary advances in the field of deep learning, focus has been shifting from the model-based ('a priori') approach to 'a posteriori' strategies, where the features are learned from the data. Both methods have pros and cons; which one should be used in any specific application however, is far from clear. In this context, Karabag et al. [ ˇ 15] comparatively evaluated traditional and deep learning methods for texture segmentation. In their work, they considered five well-known hand-designed methods (co-occurrence, filtering, local binary patterns, watershed, and multiresolution sub-band filtering) and a deep learning approach based on the U-Net architecture. The methods were evaluated on six classic mosaics of textured images. The main conclusion is that U-Net is effective for texture segmentation and provides equal or better than achieved with traditional texture algorithms. The authors also concluded that determining the correct configuration of the network is not a trivial task, and that variations of some parameters can easily lead to suboptimal results.

#### **5. Reviews**

Buzzelli [16] presented a valuable review of different approaches for automatic estimation of visual saliency—i.e., the perceptual property that makes specific elements in a scene stand out and attract the attention of the viewer. The work mainly investigates those domains where research attention is currently high, such as omnidirectional images, image groups for cosaliency, and video sequences. The paper also introduces domain-specific evaluation measures and provides quantitative comparisons among the different methods.

**Author Contributions:** All the authors have contributed equally. All authors have read and agreed to the published version of the manuscript

**Funding:** Partially supported by the Department of Engineering, Università degli Studi di Perugia, Italy, within the project *Artificial intelligence for Earth observation* (Fundamental Research Grant Scheme 2020).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This Special Issue would not have been possible without the valuable contribution of the authors, reviewers, copy editors and other members of the editorial team. We particularly wish to thank Tamia Qing, Section Managing Editor, for her continuous support throughout all the phases of the process.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Color–Texture Pattern Classification Using Global–Local Feature Extraction, an SVM Classifier, with Bagging Ensemble Post-Processing**

## **Carlos F. Navarro and Claudio A. Perez \***

Image Processing Laboratory, Electrical Engineering Department and Advanced Mining Technology Center, Universidad de Chile, Santiago 8370451, Chile

**\*** Correspondence: clperez@ing.uchile.cl; Tel.: +0-562-2978-4207; Fax: +0-562-2672-0162

Received: 12 June 2019; Accepted: 30 July 2019; Published: 1 August 2019

**Featured Application: The proposed method is a new tool to characterize colored textures and may be applied in various applications such as content image retrieval, characterization of rock samples, biometrics, classification of fabrics, and in non-destructive inspection in wood, steel, ceramic, fruit, and aircraft surfaces.**

**Abstract:** Many applications in image analysis require the accurate classification of complex patterns including both color and texture, e.g., in content image retrieval, biometrics, and the inspection of fabrics, wood, steel, ceramics, and fruits, among others. A new method for pattern classification using both color and texture information is proposed in this paper. The proposed method includes the following steps: division of each image into global and local samples, texture and color feature extraction from samples using a Haralick statistics and binary quaternion-moment-preserving method, a classification stage using support vector machine, and a final stage of post-processing employing a bagging ensemble. One of the main contributions of this method is the image partition, allowing image representation into global and local features. This partition captures most of the information present in the image for colored texture classification allowing improved results. The proposed method was tested on four databases extensively used in color–texture classification: the Brodatz, VisTex, Outex, and KTH-TIPS2b databases, yielding correct classification rates of 97.63%, 97.13%, 90.78%, and 92.90%, respectively. The use of the post-processing stage improved those results to 99.88%, 100%, 98.97%, and 95.75%, respectively. We compared our results to the best previously published results on the same databases finding significant improvements in all cases.

**Keywords:** colored texture pattern classification; global–local texture classification; color–texture features; color–texture feature extraction; bagging post-processing; BQMP and Haralick global–local feature integration

## **1. Introduction**

Texture pattern classification was considered an important problem in computer vision for many years because of the great variety of possible applications, including non-destructive inspection of abnormalities on wood, steel, ceramics, fruit, and aircraft surfaces [1–6]. Texture discrimination remains a challenge since the texture of objects varies significantly according to the viewing angle, illumination conditions, scale change, and rotation [1,4,7,8]. There is also the special problem of color image retrieval related to appearance-based object recognition, which is a major field of development for several industrial vision applications [1,4,7,8].

Feature extraction of color, texture, and shape from images was used successfully to classify patterns by reducing the dimensionality and the computational complexity of the problem [3,9–16]. Determining the appropriate features for each problem is a recurring challenge which is yet to be fully met by the computer vision community [1,3,16]. Feature extraction and selection enable representation of the information present in the image, and limit the number of features, thus allowing further analysis within a reasonable time. Feature extraction was used in a wide range of applications, such as biometrics [12,14,15], classification of cloth, surfaces, landscapes, wood, and rock minerals [16,17], saliency detection [18], and background subtraction [19], among others. During the past 40 years, while a substantial number of methods for grayscale texture classification were developed [3,5], there was also a growing interest in colored textures [1,2,9,10,13,20,21]. The adaptive integration of color and texture attributes into the development of complex image descriptors is an area of intensive research in computer vision [21]. Most of these investigations focused on the integration process in applications for digital image segmentation [20,22] or the aggregation of multiple preexisting image descriptors [23]. Deep learning was applied successfully to object or scene recognition [24] and scene classification [25], and the use of deep neural networks for the classification of image datasets where texture features are important for generating class-conditional discriminative representations was investigated [26].

Current approaches to color texture analysis can be classified into three groups: the parallel approach, the sequential approach, and the integrative approach [11,27]. The parallel approach considers texture and color as separate phenomena. Color analysis is based on the color distribution in an image, without regard to the spatial relationship between the intensities of the pixels. Texture analysis is based on the relative variation of the intensity of the neighbors, regardless of the color of the pixels. In Reference [2], the authors first converted the original RGB images into other color spaces: HSI (Hue, Saturation, Lightness), CIE XYZ (Comission Internationale de l´Éclairage Tristimulus values), YIQ (Luminance In phase Quadrature), and CIELAB (Comission Internationale de l´Éclairage Lightness-Green-Blue), and then extracted the texture features and color separately. In a similar manner, as reported in Reference [13], the images were transformed to the color spaces HSV and YCbCr, obtaining wavelet intensity channel features of first-order statistics on each channel. The choice of the best performing color space was an open question in recent years, since using one space instead of another can bring considerable improvements in certain applications [28]. The quaternion representation of color was shown to be effective in segmentation in Reference [29] and the feature extraction method, binary quaternion moment preserving (BQMP), is a method of image binarization using quaternions, with the potential for being a powerful tool in color image analysis [6].

In the sequential approach to color texture analysis, the first step is to apply a method of indexing color images. As a result, indexed images are processed obtaining grayscale textures. The co-occurrence matrix was used extensively since it represents the probability of occurrence of the same color pixel pair at a given distance [9]. Another example of this approach is based on texture descriptors using three different color indexing methods and three different texture features [11]. This results in nine independent classifiers that are combined through various schemes.

Integration models are based on the interdependence of texture and color. These models can be divided into single bands, if the data of each channel is considered separately, or a multiband, if two or more channels are considered together. The advantage of the single-band approach is the easy adaptation of classical models based on a grayscale domain, such as Gabor filters [15,30–32], local binary patterns (LBP) or variants [5,7,8,33–38], Galois fields [39], or Haralick statistics [3]. In Reference [2], the main objective was to determine the contribution of color information for the overall performance of classification using Gabor filters and co-occurrence measures, yielding results almost 10% better than those obtained with only grayscale images. In Reference [4], the results reported using co-occurrence matrices reached 94.41% and 99.07% on the Outex and Vistex databases, respectively. Different classifiers, such as k-nearest neighbors, neural networks, and support vector machines (SVM) [40], are used to combine features. The latter was proven to be more efficient when feature selection is performed [41,42] or clustering is used [22].

There are other methods that reached the best results in color texture analysis on databases that are publicly available. In Reference [1], a multi-scale architecture (Multi-Scale Supervised self-Organizing Orientational-invariant Neural or multi-scale SOON) was presented that reached 91.03% accuracy on the Brodatz database, which contains 111 different colored textures. These results were compared with those of two previous studies on the same database, reported in References [43,44], which reached classification rates of 89.71% and 88.15%, respectively. Another approach used all the information from sensors that created the image [4]. This method improved the results to 98.61% and 99.07%, on the same database, but required a non-trivial change in the architecture of the data collection. A texture descriptor based on a local pattern encoding scheme using local maximum sum and difference histograms was proposed in Reference [8]. The experimental results were tested on the Outex and KTH-TIPS2b databases reaching 95.8% and 91.3%, respectively. In References [32,33,36,39], the methods were not tested in the complete databases. In References [26,45,46], the methods used a different metric to calculate the classification.

A new method for the classification of complex colored textures is proposed in this paper. The images are divided into global and local samples where features are extracted to provide a new representation into global and local features. The feature extraction from different samples of the image using quadrants is described. The extraction of the features in each of the image quadrants, obtaining color and texture features in different spatial scales, is presented: the global scales using the whole image, and the local scales using quadrants. This representation seems to capture most of the information present in the image to improve colored texture classification. Then, a support vector machine classification that was performed is reported and, finally, the post-processing stage using the bagging that was implemented is presented. The proposed method was tested on four different databases: Colored Brodatz Texture [20], VisTex [4], Outex [47], and KTH-TIPS2b [48] databases. The subdivision of the training partition of the database into sub-images while extracting information from each sub-image of different sizes is also new. The results were compared to state-of-the art methods whose results were already published on the same databases.

#### **2. Materials and Methods**

The objective was to create a method for colored texture classification that would improve the classification of different complex color patterns. The BQMP method was used previously in color data compression, color edge detection, and multiclass clustering of color data, but not in classification [6,47]. The BQMP reduces an image to representative colors and creates a histogram that indicates the parts of the image that are represented by these colors. Therefore, the color features of the image are represented in this histogram. Haralick's features [3] are often extracted to characterize textures that measure the grayscale distribution, as well as considering the spatial interactions between pixels [3,9,23,38]. Creating a training set is part of the strategy for obtaining local and global features that contain all the information, local and global, for achieving correct classification. Different classifiers, such as k-nearest neighbors, neural networks, and support vector machines (SVM) [41], are used to combine features. In Reference [44], an SVM showed good performance compared to 16 different classification methods.

#### *2.1. Feature Extraction Quadrants*

The proposed method divides the images to obtain local and global features from them. The method consists of four stages: firstly, the images in the database are divided into images to be used in training and those to be used in testing. In the second stage, color and texture features are extracted from the training images on both global and local scales. In the third stage, color and texture features are fused to become the inputs to the SVM classifier, and the last stage is a post-processing stage that uses bagging with the test images for classification. These stages are summarized in the block diagram of Figure 1.

*Appl. Sci.* **2019**, *9*, 4 of 20

**Figure 1.** Block diagram of the proposed method. **Figure 1.** Block diagram of the proposed method. **Figure 1.** Block diagram of the proposed method.

To be able to compare the performance of our method with previously published results, we used the same partition, into training and testing sets, in each database. In the case of the Brodatz database, as in Reference [1], each colored texture image was partitioned into nine sub-images, using eight for training, and one for testing. An example of this partition is shown in Figures 2a–c. The Brodatz Colored Texture (CBT) database has 111 images of 640 × 640 pixels. Each image in the database has a different texture. To be able to compare the performance of our method with previously published results, we used the same partition, into training and testing sets, in each database. In the case of the Brodatz database, as in Reference [1], each colored texture image was partitioned into nine sub-images, using eight for training, and one for testing. An example of this partition is shown in Figure 2a–c. The Brodatz Colored Texture (CBT) database has 111 images of 640 × 640 pixels. Each image in the database has a different texture. To be able to compare the performance of our method with previously published results, we used the same partition, into training and testing sets, in each database. In the case of the Brodatz database, as in Reference [1], each colored texture image was partitioned into nine sub-images, using eight for training, and one for testing. An example of this partition is shown in Figures 2a–c. The Brodatz Colored Texture (CBT) database has 111 images of 640 × 640 pixels. Each image in the database has a different texture.

**Figure 2.** (**a**) The Brodatz image (D88) is used to create the training image (**b**) and test images (**c**). (**d**) **Figure 2.** (**a**) The Brodatz image (D88) is used to create the training image (**b**) and test images (**c**). (**d**) The Vistex image (Food0007) is used to create the training images (**e**) and test images (**f**). (**g**) The Outex **Figure 2.** (**a**) The Brodatz image (D88) is used to create the training image (**b**) and test images (**c**). (**d**) The Vistex image (Food0007) is used to create the training images (**e**) and test images (**f**). (**g**) The Outex image (Canvas002) is used to create the training images (**h**) and test images (**i**).

The Vistex image (Food0007) is used to create the training images (**e**) and test images (**f**). (**g**) The Outex image (Canvas002) is used to create the training images (**h**) and test images (**i**). For the Vistex database, the number of training and testing images was eight as in Reference [4]. In the case of the Outex database, the number of training and testing sub-images was 10 as in image (Canvas002) is used to create the training images (**h**) and test images (**i**). For the Vistex database, the number of training and testing images was eight as in Reference [4]. In the case of the Outex database, the number of training and testing sub-images was 10 as in References [4,9,47]. The KTH-TIPS2b database was already partitioned into four samples, and we For the Vistex database, the number of training and testing images was eight as in Reference [4]. In the case of the Outex database, the number of training and testing sub-images was 10 as in References [4,9,47]. The KTH-TIPS2b database was already partitioned into four samples, and we performed a cross-validation as in References [7,48].

References [4,9,47]. The KTH-TIPS2b database was already partitioned into four samples, and we performed a cross-validation as in References [7,48]. performed a cross-validation as in References [7,48]. In each case, we subdivided the training database and the test database using two parameters: n is the number of images divided by side, and r is the times we take n<sup>2</sup> local images from each sample. We

can take r × n 2 local images to extract features from all the samples in each database. Figure 3 shows the image subdivision scheme for the training images. It can be observed that the partitioning scheme allows obtaining features from different parts of the image, at a global and local level. The method was designed to follow this approach so that no relevant information would be lost from the image. partitioning scheme allows obtaining features from different parts of the image, at a global and local level. The method was designed to follow this approach so that no relevant information would be lost from the image. level. The method was designed to follow this approach so that no relevant information would be lost from the image.

Figure 3 shows the image subdivision scheme for the training images. It can be observed that the

Figure 3 shows the image subdivision scheme for the training images. It can be observed that the partitioning scheme allows obtaining features from different parts of the image, at a global and local

*Appl. Sci.* **2019**, *9*, 5 of 20

*Appl. Sci.* **2019**, *9*, 5 of 20

In each case, we subdivided the training database and the test database using two parameters:

In each case, we subdivided the training database and the test database using two parameters: n is the number of images divided by side, and r is the times we take n2 local images from each

**Figure 3.** (**a**) A sample image from Vistex (Food0007) and the subdivision process. (**b**,**e**) Image is divided into n2 images. (**c**,**f**) Image is divided into n2 random blocks. (**d**,**g**) Image is divided into r × n2 random blocks. In this example, (**b**–**d**) n = 2, (**e**–**g**) n = 3, and r = 2 (**d**,**g**). **Figure 3.** (**a**) A sample image from Vistex (Food0007) and the subdivision process. (**b**,**e**) Image is divided into n<sup>2</sup> images. (**c**,**f**) Image is divided into n<sup>2</sup> random blocks. (**d**,**g**) Image is divided into r × n 2 random blocks. In this example, (**b**–**d**) n = 2, (**e**–**g**) n = 3, and r = 2 (**d**,**g**). **Figure 3.** (**a**) A sample image from Vistex (Food0007) and the subdivision process. (**b**,**e**) Image is divided into n2 images. (**c**,**f**) Image is divided into n2 random blocks. (**d**,**g**) Image is divided into r × n2 random blocks. In this example, (**b**–**d**) n = 2, (**e**–**g**) n = 3, and r = 2 (**d**,**g**).

BQMP and Haralick (without using co-occurrence matrices) are invariant to translation and rotation; the same features are obtained if an exchange of the position of two pixels is made in the image [3,6,47]. This suggests that there is spatial information present in the image that is not extracted by these features. In our proposed method, we use the two-scale scheme, local and global, to add spatial information to the extracted features. This is shown in Figure 4, and explained in detail in Section 2.4. The BQMP and Haralick features are extracted in each quadrant. The test images can be subdivided into local images from which the features are extracted. This allows the creation of a postprocessing stage in which a bagging process can be performed. Our method is invariant to translation because of the randomness of the local image positions, but it is partially invariant to rotation because the features are concatenated in an established order. However, color features, as well as Haralick texture features, are invariant to rotation. There are problems where orientation dependency is desirable [49]. BQMP and Haralick (without using co-occurrence matrices) are invariant to translation and rotation; the same features are obtained if an exchange of the position of two pixels is made in the image [3,6,47]. This suggests that there is spatial information present in the image that is not extracted by these features. In our proposed method, we use the two-scale scheme, local and global, to add spatial information to the extracted features. This is shown in Figure 4, and explained in detail in Section 2.4. The BQMP and Haralick features are extracted in each quadrant. The test images can be subdivided into local images from which the features are extracted. This allows the creation of a post-processing stage in which a bagging process can be performed. Our method is invariant to translation because of the randomness of the local image positions, but it is partially invariant to rotation because the features are concatenated in an established order. However, color features, as well as Haralick texture features, are invariant to rotation. There are problems where orientation dependency is desirable [49]. BQMP and Haralick (without using co-occurrence matrices) are invariant to translation and rotation; the same features are obtained if an exchange of the position of two pixels is made in the image [3,6,47]. This suggests that there is spatial information present in the image that is not extracted by these features. In our proposed method, we use the two-scale scheme, local and global, to add spatial information to the extracted features. This is shown in Figure 4, and explained in detail in Section 2.4. The BQMP and Haralick features are extracted in each quadrant. The test images can be subdivided into local images from which the features are extracted. This allows the creation of a postprocessing stage in which a bagging process can be performed. Our method is invariant to translation because of the randomness of the local image positions, but it is partially invariant to rotation because the features are concatenated in an established order. However, color features, as well as Haralick texture features, are invariant to rotation. There are problems where orientation dependency is desirable [49].

**Figure 4.** Example of (**a**) the original image. (**b**) The four local partitions and (**c**) the global partition. These five images generate two spatial scales: four local (**b**) and one global, adding spatial information **Figure 4.** Example of (**a**) the original image. (**b**) The four local partitions and (**c**) the global partition. These five images generate two spatial scales: four local (**b**) and one global, adding spatial information to the extracted features. **Figure 4.** Example of (**a**) the original image. (**b**) The four local partitions and (**c**) the global partition. These five images generate two spatial scales: four local (**b**) and one global, adding spatial information to the extracted features.

to the extracted features.

#### *2.2. BQMP Color Feature Extraction 2.2. BQMP Color Feature Extraction*

After image subdivision, the BQMP was applied to each one of the local and global sub-images. This method is used as a tool for extracting color features using quaternions. Each pixel in the RGB space is represented as a quaternion. In Reference [6], the authors showed that it is possible to obtain two quaternions that represent a different part of the image, obtaining a binarization of the image in the case of two colors. For more colors, this method can be performed recursively n times, yielding 2 n representatives of an image, and the part of the image that each representative represents in a histogram. The process is repeated, obtaining a result with a binary tree structure. Figure 5 shows the BQMP method for the case of four different colors. The numbers show the color code and the number of pixels represented by each. Figure 6 shows the second iteration for the same case. *Appl. Sci.* **2019**, *9*, 6 of 20 *2.2. BQMP Color Feature Extraction*  After image subdivision, the BQMP was applied to each one of the local and global sub-images. This method is used as a tool for extracting color features using quaternions. Each pixel in the RGB space is represented as a quaternion. In Reference [6], the authors showed that it is possible to obtain two quaternions that represent a different part of the image, obtaining a binarization of the image in the case of two colors. For more colors, this method can be performed recursively n times, yielding 2n representatives of an image, and the part of the image that each representative represents in a After image subdivision, the BQMP was applied to each one of the local and global sub-images. This method is used as a tool for extracting color features using quaternions. Each pixel in the RGB space is represented as a quaternion. In Reference [6], the authors showed that it is possible to obtain two quaternions that represent a different part of the image, obtaining a binarization of the image in the case of two colors. For more colors, this method can be performed recursively n times, yielding 2n representatives of an image, and the part of the image that each representative represents in a histogram. The process is repeated, obtaining a result with a binary tree structure. Figure 5 shows the BQMP method for the case of four different colors. The numbers show the color code and the number of pixels represented by each. Figure 6 shows the second iteration for the same case.

*Appl. Sci.* **2019**, *9*, 6 of 20

The feature vector is formed by concatenating the color code and histograms, through normalization. This vector is concatenated with the ones in other scales and the ones made using Haralick statistics, which are also normalized. histogram. The process is repeated, obtaining a result with a binary tree structure. Figure 5 shows the BQMP method for the case of four different colors. The numbers show the color code and the number of pixels represented by each. Figure 6 shows the second iteration for the same case.

**Figure 5.** First binary quaternion-moment-preserving (BQMP) iteration example for the case of four **Figure 5.** First binary quaternion-moment-preserving (BQMP) iteration example for the case of four different colors. Numbers show the color coded in (q<sup>0</sup> , q<sup>1</sup> , q<sup>2</sup> , q<sup>3</sup> ) and the number of pixels for each color representative (histogram). **Figure 5.** First binary quaternion-moment-preserving (BQMP) iteration example for the case of four different colors. Numbers show the color coded in (q0, q1, q2, q3) and the number of pixels for each color representative (histogram).

**Figure 6.** Second BQMP iteration example for the case of four different colors. Numbers show the **Figure 6.** Second BQMP iteration example for the case of four different colors. Numbers show the color coded in (q<sup>0</sup> , q<sup>1</sup> , q<sup>2</sup> , q<sup>3</sup> ) and the number of pixels for each color representative (histogram).

**Figure 6.** Second BQMP iteration example for the case of four different colors. Numbers show the color coded in (q0, q1, q2, q3) and the number of pixels for each color representative (histogram).

color coded in (q0, q1, q2, q3) and the number of pixels for each color representative (histogram).

Haralick statistics, which are also normalized.

Haralick statistics, which are also normalized.

*2.3. Haralick Texture Features* 

*2.3. Haralick Texture Features* 

The feature vector is formed by concatenating the color code and histograms, through normalization. This vector is concatenated with the ones in other scales and the ones made using

The feature vector is formed by concatenating the color code and histograms, through normalization. This vector is concatenated with the ones in other scales and the ones made using

#### *2.3. Haralick Texture Features*

The Haralick texture features are angular second moment, contrast, correlation, sum of squares, inverse difference moment, sum average, sum variance, sum entropy, entropy, difference variance, difference entropy, and information measures of correlation [3]. Other measures characterize the complexity and nature of tone transitions in each channel of the image. The usual practice is to use the first 13 Haralick features with a co-occurrence matrix [4], but in this work, we extracted the 13Haralick features directly from the images because they provide spatial information from different regions within each image.

The Haralick features used to extract the texture features were Equations (1)–(13), and Equations (14)–(21) explain the notation employed.

Angular second moment:

$$f\_1 = \sum\_{i} \sum\_{j} p(i, j)^2. \tag{1}$$

Contrast:

$$f\_2 = \sum\_{n=0}^{N-1} n^2 \sum\_{i=1}^{N} \sum\_{j=1}^{N} p(i, j). \tag{2}$$

where *<sup>i</sup>* <sup>−</sup> *<sup>j</sup>* <sup>=</sup> *<sup>n</sup>*. Correlation:

$$f\_3 = \frac{\sum\_{i} \sum\_{j} (ij)p(i,j) - \mu\_{\ge}\mu\_y}{\sigma\_{\ge}\sigma\_y} \, \tag{3}$$

where µ*x*, µ*y*, σ*x*, and σ*<sup>y</sup>* are the means and standard deviations of *p<sup>x</sup>* and *py*.

Sum of squares:

$$f\_4 = \sum\_{i} \sum\_{j} (i - \mu)^2 p(i, j). \tag{4}$$

Inverse difference moment:

$$f\_5 = \sum\_{i} \sum\_{j} \frac{p(i, j)}{1 + (i - j)^2}. \tag{5}$$

Sum average:

$$f\_{\mathsf{f}} = \sum\_{i=2}^{2N} i p\_{x+y}(i). \tag{6}$$

Sum variance:

$$f\_7 = \sum\_{i=2}^{2N} (i - f\_8)^2 p\_{x+y}(i). \tag{7}$$

Sum entropy:

$$f\_8 = -\sum\_{i=2}^{2N} p\_{x+y}(i) \log(p\_{x+y}(i)).\tag{8}$$

Entropy:

$$f\mathfrak{g} = -\sum\_{i}\sum\_{j} p(i,j)\log\left(p(i,j)\right). \tag{9}$$

Difference variance:

$$f\_{10} = Var(p\_{\text{x-y}}).\tag{10}$$

Difference entropy:

$$f\_{11} = -\sum\_{i=0}^{N-1} p\_{x-y}(i) \log \left( p\_{x-y}(i) \right). \tag{11}$$

Information measures of correlation:

$$f\_{12} = \frac{f\_{\eth} - HXY1}{\max(HX, HY)},\tag{12}$$

$$f\_{13} = 1 - \exp\left(-2(HXY2 - f\_9)\right)^{\frac{1}{2}},\tag{13}$$

where *HX* and *HY* are the entropies of *p<sup>x</sup>* and *py*, and

$$HXY = -\sum\_{i} \sum\_{j} p(i, j) \log(p(i, j)),\tag{14}$$

$$HXY1 = -\sum\_{i} \sum\_{j} p(i, j) \log \left( p\_x(i) p\_y(j) \right), \text{ and} \tag{15}$$

$$HXYZ = -\sum\_{i} \sum\_{j} p\_x(i) p\_y(j) \log \left( p\_x(i) p\_y(j) \right). \tag{16}$$

In Equations (1)–(13), to calculate the features, the following notation was used:

$$p(i,j)\_\prime \tag{17}$$

which is the (*i*,*j*)th pixel in a gray sub-image matrix.

$$p\_{\mathbf{x}}(i)\_{\mathbf{y}} \tag{18}$$

which is the *i*th entry in the sub-image matrix, obtained by summing the rows of *p*.

$$p\_{\mathcal{Y}}(j)\_{\mathcal{Y}} \tag{19}$$

which is the *j*th entry in the sub-image matrix, obtained by summing the columns of *p*.

$$P\_{\mathbf{x}+\mathbf{y}}(\mathbf{k}) = \sum\_{i=1}^{n} \sum\_{j=1}^{n} p(i,j),\tag{20}$$

where *i* + *j* = k and k = 2, 3, . . . , 2N.

$$P\_{\mathbf{x}-\mathbf{y}}(\mathbf{k}) = \sum\_{i=1}^{n} \sum\_{j=1}^{n} p(i,j),\tag{21}$$

where |*i* − *j*| = k and k = 0,1, . . . , N − 1.

#### *2.4. Feature Extraction*

The feature extraction is performed using local and global scales. The feature vector is created by extracting features from each different image partition (local and global). An example is shown in Figure 7. The original image (a) is divided into five partitions, (b) four local and one global (the same original image), as shown in Figure 4. The feature vector is obtained from each partition as shown in (c). Since BQMP is applied once, we obtain different values for two representative colors for each sub-image. As in the example shown in Figure 7c, the first representative color is brown with R1 = 78, G1 = 62, and B1 = 39. The other color is pink with R2 = 215, G2 = 80, B2 = 119. Through binarization, we know that brown represents 41% of the image and pink 59%; therefore, H1 = 0.41 and H2 = 0.59.

are concatenated.

iterations of the method.

F11= 4.63, F12 = −6.37 × 10−2, and F13 = 6.78 × 10−1.

To achieve the binarization, the three-dimensional RGB information was transformed in a four-dimensional quaternion. Those quaternions were used to obtain the moments of order 1, 2, and 3, and the moments were used to obtain the equations of momentum conservation, to obtain the representative colors (R1,G1,B1 and R2,G2,B2) and the representative histograms (H1 and H2), as Reference [6] described. The moments were computed using quaternion multiplication that is a four-dimensional operation. For example, in the case of two quaternions *a* = *a*<sup>1</sup> + *a*2*i* + *a*<sup>3</sup> *j* + *a*4*k* and *b* = *b*<sup>1</sup> + *b*2*i* + *b*<sup>3</sup> *j* + *b*4*k*, the product will be equal to *ab* = (*a*1*b*<sup>1</sup> − *a*2*b*<sup>2</sup> − *a*3*b*<sup>3</sup> − *a*4*b*4) + (*a*1*b*<sup>2</sup> + *a*2*b*<sup>1</sup> + *a*3*b*<sup>4</sup> − *a*4*b*3)*i* + (*a*1*b*<sup>3</sup> − *a*2*b*<sup>4</sup> + *a*3*b*<sup>1</sup> + *a*4*b*2)*j* + (*a*1*b*<sup>4</sup> + *a*2*b*<sup>3</sup> − *a*3*b*<sup>2</sup> + *a*4*b*1)*k*. Therefore, even if *a*<sup>1</sup> and *b*<sup>1</sup> are equal to zero, the real part of the multiplication will not necessarily be zero. In the case of the example, this extra information is Q1 = −0.61 in the first color and Q2 = 0.43 in the second color. *Appl. Sci.* **2019**, *9*, 9 of 20 To achieve the binarization, the three-dimensional RGB information was transformed in a fourdimensional quaternion. Those quaternions were used to obtain the moments of order 1, 2, and 3, and the moments were used to obtain the equations of momentum conservation, to obtain the representative colors (R1,G1,B1 and R2,G2,B2) and the representative histograms (H1 and H2), as Reference [6] described. The moments were computed using quaternion multiplication that is a fourdimensional operation. For example, in the case of two quaternions =ଵ + ଶ+ଷ+ସ and =ଵ + ଶ+ଷ+ସ, the product will be equal to = (ଵଵ − ଶଶ − ଷଷ − ସସ) + (ଵଶ + ଶଵ + ଷସ − ସଷ) + (ଵଷ − ଶସ + ଷଵ + ସଶ) + (ଵସ + ଶଷ − ଷଶ + ସଵ). Therefore, even

Then, we extracted the 13 Haralick features (explained in Section 2.3) in each sub-image and each color channel, obtaining 13 × 5 × 3 = 195 more features to concatenate into the final vector. if ଵ and ଵ are equal to zero, the real part of the multiplication will not necessarily be zero. In the case of the example, this extra information is Q1 = −0.61 in the first color and Q2 = 0.43 in the second

In the texture of Figure 7, we performed only one BQMP iteration in an image from Brodatz database, obtaining only two color representatives. In general, the BQMP method generates 2<sup>n</sup> representatives from each image, when n iterations are used. In Figures 5 and 6, an example for a simple color pattern shows the representatives for two iterations, n = 2. In our preliminary experiments, the results did not improve significantly for n ≥ 3, and computational time increased significantly. The feature extraction was performed in local and global scales, so that the representative colors would capture the diversity of the whole image in a local and global manner. color. Then, we extracted the 13 Haralick features (explained in Section 2.3) in each sub-image and each color channel, obtaining 13 × 5 × 3 = 195 more features to concatenate into the final vector. In the texture of Figure 7, we performed only one BQMP iteration in an image from Brodatz database, obtaining only two color representatives. In general, the BQMP method generates 2n representatives from each image, when n iterations are used. In Figures 5 and 6, an example for a simple color pattern shows the representatives for two iterations, n = 2. In our preliminary

In the case of more complex textures, it is possible to use more iterations of the BQMP method obtaining more representative colors, histograms, and quaternions. Figure 8 shows the feature extraction from one sample image from the Vistex database (Food0007) using one, two, or three iterations of the method. experiments, the results did not improve significantly for n ≥ 3, and computational time increased significantly. The feature extraction was performed in local and global scales, so that the representative colors would capture the diversity of the whole image in a local and global manner.

**Figure 7.** Feature vector extracted from one image from the Brodatz Database. (**a**) The original image is divided into five partitions (**b**): four local partitions and one global (the same image, bottom). (**c**) The feature vectors obtained from each partition. The color feature vector and texture feature vector **Figure 7.** Feature vector extracted from one image from the Brodatz Database. (**a**) The original image is divided into five partitions (**b**): four local partitions and one global (the same image, bottom). (**c**) The feature vectors obtained from each partition. The color feature vector and texture feature vector are concatenated.

In the case of more complex textures, it is possible to use more iterations of the BQMP method obtaining more representative colors, histograms, and quaternions. Figure 8 shows the feature extraction from one sample image from the Vistex database (Food0007) using one, two, or three

As in the example shown in Figure 8c, the first representative color is dark blue with R1 = 30, G1 = 32, and B1 = 69. The other color is cream with R2 = 217, G2 = 172, and B2 = 106. Through binarization,

F4 = 4.75 × 103, F5 = 2.60 × 10−2, F6 = 1.24 × 102, F7 = 1.65 × 104, F8 = 5.28, F9 = 9.32, F10 = 2.43 × 10−5,

**Figure 8.** Feature vector extracted from one image of the Vistex Database (Food0007). (**a**) The original image is divided into five partitions (**b**): four local partitions and one global (the same image, bottom). (**c**) The feature vectors obtained from each partition. The color feature vectors and texture feature vectors are concatenated. (**d**–**g**) The global sub-image as a real example using one iteration (**e**), two iterations (**f**), or three iterations (**g**) of the BQMP method. **Figure 8.** Feature vector extracted from one image of the Vistex Database (Food0007). (**a**) The original image is divided into five partitions (**b**): four local partitions and one global (the same image, bottom). (**c**) The feature vectors obtained from each partition. The color feature vectors and texture feature vectors are concatenated. (**d**–**g**) The global sub-image as a real example using one iteration (**e**), two iterations (**f**), or three iterations (**g**) of the BQMP method.

*2.5. SVM Classifier and Post-Processing*  After features were extracted from each image, an SVM classifier was used to determine each texture class. The SVM became very popular within the machine learning community due to its great classification potential [41,42]. The SVM maps input vectors in a non-linear transformation to a highdimensional space where a linear decision hyperplane is constructed for class separation. A Gaussian SVM kernel was used, and a coarse exhaustive search over the remaining SVM parameters was As in the example shown in Figure 8c, the first representative color is dark blue with R1 = 30, G1 = 32, and B1 = 69. The other color is cream with R2 = 217, G2 = 172, and B2 = 106. Through binarization, we know that dark blue represents 74% of the image and cream 26%; therefore, H1 <sup>=</sup> 0.74and H2 <sup>=</sup> 0.26. Q1 and Q2 are 1.49 and <sup>−</sup>2.43, respectively. The Haralick features computed from Equations (1)–(13) for the first sub-image are the following: F1 <sup>=</sup> 1.09 <sup>×</sup> <sup>10</sup>−<sup>4</sup> , F2 <sup>=</sup> 2.66 <sup>×</sup> <sup>10</sup><sup>3</sup> , F3 <sup>=</sup> 9.90 <sup>×</sup> <sup>10</sup><sup>8</sup> , F4 <sup>=</sup> 4.75 <sup>×</sup> <sup>10</sup><sup>3</sup> , F5 <sup>=</sup> 2.60 <sup>×</sup> <sup>10</sup>−<sup>2</sup> , F6 <sup>=</sup> 1.24 <sup>×</sup> <sup>10</sup><sup>2</sup> , F7 <sup>=</sup> 1.65 <sup>×</sup> <sup>10</sup><sup>4</sup> , F8 = 5.28, F9 = 9.32, F10 <sup>=</sup> 2.43 <sup>×</sup> <sup>10</sup>−<sup>5</sup> , F11<sup>=</sup> 4.63, F12 <sup>=</sup> <sup>−</sup>6.37 <sup>×</sup> <sup>10</sup>−<sup>2</sup> , and F13 <sup>=</sup> 6.78 <sup>×</sup> <sup>10</sup>−<sup>1</sup> .

#### performed to find the optimal configuration on the training set. A grid search with cross-validation was used to find the best parameters for the multiclass SVM *2.5. SVM Classifier and Post-Processing*

cascade. We used half of the training set to determine the SVM parameters, and the other half in validation. For testing, we used a different set as explained in Section 3.1. In the case of bagging, we took repeated samples from the original training set for balancing the class distributions to generate new balanced datasets. Two parameters were tuned: the number of decision trees voting in the After features were extracted from each image, an SVM classifier was used to determine each texture class. The SVM became very popular within the machine learning community due to its great classification potential [41,42]. The SVM maps input vectors in a non-linear transformation to a high-dimensional space where a linear decision hyperplane is constructed for class separation.

A Gaussian SVM kernel was used, and a coarse exhaustive search over the remaining SVM parameters was performed to find the optimal configuration on the training set.

A grid search with cross-validation was used to find the best parameters for the multiclass SVM cascade. We used half of the training set to determine the SVM parameters, and the other half in validation. For testing, we used a different set as explained in Section 3.1. In the case of bagging, we took repeated samples from the original training set for balancing the class distributions to generate new balanced datasets. Two parameters were tuned: the number of decision trees voting in the ensemble, and the complexity parameter related to the size of the decision tree. The method was trained for texture classification using the training sets as they are specified for each database.

In order to have a fair comparison between our obtained classification rates and those previously published, we employed the same partitions used for training and testing as in Diaz-Pernas et al., 2011 [1], Khan et al., 2015 [7], Arvis et al., 2004 [9], Mäenpää et al., 2004 [27], Qazi et al., 2011 [28], Losson et al., 2013 [4], and Couto et al., 2017 [50]. The training and test sets came from separate sub-images, and the methods never used the same sub-image for both training and testing.

In general, combining multiple classification models increases predictive performance [51]. In the post-processing stage, a bagging predictive model composed of a weighted combination of weak classifiers was performed with the results of the SVM model [52]. Bagging is a technique which uses bootstrap sampling to reduce the variance and improve the accuracy of a predictor [51]. It may be used in classification and regression. We created a bagging ensemble for classification using deep trees as weak learners. The bagging predictor was trained with new images taken from the training set of each database. This result was assigned as the final classification for each image.

We compared our results with those published previously on the same databases.

#### *2.6. Databases*

It is important to validate the method on standard colored texture databases with previously published results [53]. Therefore, we chose four colored texture databases that were used recently for this purpose: the Colored Brodatz Texture (CBT) [20], Vistex [4], Outex [47], and KTH-TIPS2b [48] databases.

The Brodatz Colored Texture (CBT) database has 111 images of 640 × 640 pixels. Each image in the database has a different texture. The Vistex Database was developed at Massachusetts Institute of Technology (MIT). It has 54 images of 512 × 512 pixels. Each image in the database has a natural color texture. The Outex Database was developed at the University of Oulu, Finland. We used 68 color texture images of 746 × 538, to obtain 1360 images of 128 × 128 with 68 different textures. Each image in the database has a natural color texture. Finally, the KTH-TIPS2b database contains images of 200 × 200 pixels. It has four samples of 108 images of 11 materials at different scales. Each image in the database has a natural color texture.

#### **3. Results**

#### *3.1. Experiments*

In order to have a fair comparison between our obtained classification rates and those previously published, we used the same databases and partitions used for training and testing as in Diaz-Pernas et al., 2011 [1], Khan et al., 2015 [7], Arvis et al., 2004 [9], Mäenpää et al., 2004 [27], Qazi et al., 2011 [28], Losson et al., 2013 [4], and Couto et al., 2017 [50]. The training and test sets came from separate sub-images, and the methods never used the same sub-image for both training and testing.

#### 3.1.1. Brodatz Database

The methodology, as in Reference [1], used four different sets of images from the same database: the first one consisted of 10 images, the second of 30 images, the third of 40 images, and the fourth of all 111 images. The classification results in previously published articles reached 91.03% in Reference [1], 89.71% in Reference [43], and 88.15% in Reference [44] for the fourth set of the Brodatz database. We

used the same partition used for training and testing as in Diaz-Pernas et al., 2011 [1]. The Brodatz image database consists of 111 images of size 640 × 640 pixels. Partitioning each image into nine non overlapping sub-images of 213 × 213 pixels, we obtained 999 sub-images from 111 texture classes. Diaz-Pernas et al., 2011 [1], using the (2 × 2) center sub-image as training and the other to test, reached the best classification results (see Figure 2).

In training, each training sub-image with a size of 213 × 213 pixels was subdivided into a number n of images. Features were extracted from each subdivided image. The parameter n changed from 2 to 7 in the first three experiments, and from 2 to 9 in the last one. Once the feature vector was computed, each vector was assigned to a texture class using an SVM as a classifier.

#### 3.1.2. Vistex Database

The methodology used 54 images that were subdivided into 864 sub-images; 432 were used in training and the other 432 as testing images, as in Arvis et al., 2004 [9], Mäenpää et al., 2004 [27], Qazi et al., 2011 [28], Losson et al., 2013 [4], and Couto et al., 2017 [50]. Previously published results reached 98.61% and 99.07% [4] on the same database using color filter array (CFA) chromatic co-occurrence matrices (CCM). We chose the same 54 texture images to be able to compare our results with those previously published.

#### 3.1.3. Outex Database

The methodology used 68 images that were subdivided into 1360 sub-images; 680 were used in training and the other 680 as testing images, as in Arvis et al., 2004 [9]. The same partition was performed by Mäenpää et al., 2004 [27], Qazi et al., 2011 [28], Losson et al., 2013 [4], and Couto et al., 2017 [50]. Previously published results reached 94.85% and 94.41% [4] on the same database using CFA chromatic co-occurrence matrices. We chose the same 68 texture images and partition to be able to compare our results with those previously published.

#### 3.1.4. KTH-TIPS2b Database

The KTH-Tips2b database consists of four sets of 1,188 images. Each set has 11 different classes. The methodology used four sets of 108 images, each one with 11 images, making a total of 1188 images. We followed the same protocol described in Reference [7], where the average classification performance was reported over four test runs. In each run, all images from one sample were used for training, while all the images from the remaining three samples were used for testing as in Khan et al., 2015 [7]. Previously published results reached 70.6% [7] and 91.3% [8] on the same database using Divisive Information Theoretic Clustering (DITC) and three-dimensional adaptive sum and difference (3D-ASDH) methods, respectively.

#### *3.2. Results*

#### 3.2.1. Brodatz Database

Table 1 shows the classification results for the Brodatz database. In this experiment, we used the features extracted in the first set (10 images), varying the size of the images in training r × n 2 . The last column shows the results of using the post-processing stage applied on the column with the best performance. The best result for the first experiment on the Brodatz database using 10 images was 100%. In this case, 27 (3 × 3 2 ) images with size 71 × 71 (213/3) pixels were used for training. It can also be observed that the use of the post-processing step improved the results up to 100% in all cases. These results were higher than 98.23%, the best result previously published for this experiment on the Brodatz database [1]. Also, Table 1 shows the classification results for the second, third, and fourth experiments on the Brodatz database with 30, 40, and 111 images, respectively. The best result reached using 30 images was 99.84%. In this case, 64 (4 × 4 2 ) random images with size 53 × 53 (213/4) pixels were used for training. It can also be observed that the use of the post-processing step improved

the results up to 100% in all cases. These results are higher than the best result, 97.54%, previously published for this experiment on the Brodatz database [1].


**Table 1.** Classification results of the experiments on the Brodatz database for sets of 10, 30, 40, and 111 images. The best results reached with and without post-processing are highlighted by bold text.

<sup>1</sup> n is the number of images per side in the training stage; r is the times we iterated the method in each image.

The best result reached using 40 images was 99.71%. In this case, 100 (4 × 5 2 ) random images of size 42 × 42 (213/5) pixels were used for training. It can also be observed that the use of the post-processing step improved the results up to 100% in all cases. These results are higher than 95.5%, the best previously published result for this experiment on the Brodatz database [1]. The best result reached using all the 111 images was 97.63%. In this case, 196 (4 × 7 2 ) random images of size 30 × 30 (213/7) pixels were used for training. It can also be observed that the use of the post-processing step improved the results up to 99.88%. These results are higher than the bests result of 98.25% and 99.5% previously published for this experiment on the Brodatz database [30,50].

The size of the smaller images reached an optimum for n = 7 with an image size of 30 × 30. We performed an exhaustive search varying from n = 2 to n = 9, reaching an optimum at n = 7. An explanation is that n = 7 is optimal for the complete method using texture and color features. Table 2 compares the results previously published in the literature and our results on the Brodatz database for the four experiments which included 10, 30, 40, and 111 images, respectively.


**Table 2.** Best results of global–local Haralick– binary quaternion-moment-preserving (BQMP) classification for the Brodatz database for the four sets of 10, 30, 40, and 111 images compared to previously published studies. SVM—support vector machine.

<sup>1</sup> The last three methods are our results; <sup>2</sup> k-nearest neighbors.

It can be seen in Table 2 that our method, with post-processing, reached the highest results. The most significant improvement was reached on the complete Brodatz database that includes 111 images.

### 3.2.2. Vistex Database

Table 3 shows the classification results of our method applied to the Vistex database (54 images). Each image in the training set was partitioned randomly, and the number of windows per side was varied from two to four in each image chosen for training, with the number of random images from 4 × n 2 to 10 × n 2 . The first column shows the best results reached by our method, and the second column shows the results after the post-processing stage. Table 4 compares the results published previously in the literature and our results for the Vistex database with 54 images. It can be observed in Table 4 that our post-processed method reached the highest results with 100%.


**Table 3.** Classification results of the experiment on the Vistex database for the set of 54 images. The best results reached with and without post-processing are highlighted by bold text.

<sup>1</sup> n is the number of images per side in the training stage; r is the times we iterated the method in each image.

**Table 4.** Best results of global–local Haralick–BQMP classification for the Vistex database with 54 images and best results published previously on the same database.


<sup>1</sup> The last three methods are our results.

#### 3.2.3. Outex Database

Table 5 shows the classification results of our method applied to the Outex database (68 images). Each image in the training set was partitioned randomly, varying the number of windows per side from two to four in each training image, and the number of random images from 5 × n 2 to 18 × n 2 . The first column shows the best results reached by our method, and the second column shows the results after the post-processing stage. The best results are highlighted by bold text.


**Table 5.** Classification results of the experiment on the Outex database for the set of 68 images. The best results reached with and without post-processing are highlighted by bold text.

<sup>1</sup> n is the number of images per side in the training stage; r is the times we iterate the method in each image.

Table 6 compares the results published previously in the literature and our results for the Outex database with 68 images. It can be observed in Table 6 that our post-processed method reached the highest results with 98.97%.

**Table 6.** Best results of global–local Haralick–BQMP classification for the Outex database with 68 images. The best previously published results are compared to our results.


1 ; 1 the last three methods are our results.

#### 3.2.4. KTH-TIPS2b Database

Table 7 shows the classification results of our method applied to the KTH-TIPS2b database (1188 × 4 images). In each test, all the images from one sample were used for training, while the images from the remaining three samples were used as a test set. The first column shows the best results reached by our method, and the second column shows the results after the post-processing stage.


**Table 7.** Classification results of the experiment on the KTH-TIPS2b database for the four sets of 1188 images.

<sup>1</sup> n is the number of images per side in the training stage; S is the set used for training, using the other three sets as test.

Table 8 compares the results published previously in the literature and our results for the KTH-TIPS2b database with 1188 × 4 images. It can be observed in Table 8 that our post-processing method reached the highest results with 95.75%.

**Table 8.** Best results of global–local Haralick–BQMP classification for the KTH-TIPS2b database with 1188 images per set and the best previously published results on the same database.


<sup>1</sup> The last three methods are in this paper.

## 3.2.5. Color or Texture vs. Color and Texture

Table 9 compares the results of color features, texture features, and of the combination of both, for texture classification measured on the Brodatz, Vistex, Outex, and KTH-TIPS2b databases. It can be observed that both types of features, color and texture, contribute to the overall results, with maximum performance when both types of features are combined. Comparing these results to those previously published on the same databases, it can be observed that, although the method reached 100% on the Vistex database in Reference [28], the results yielded on Outex were only 94.5%.

**Table 9.** Best results of global–local Haralick–BQMP classification for the all databases with the contribution of each part of the model (color and spatial structure).


#### 3.2.6. Computation Time

Table 10 displays the computational time required for feature extraction (FE), classification time with the SVM, and post-processing (PP) time performed on the Vistex database. All implementations were carried out using Python 3 on an Intel (R) Core (TM) i7-7700HM 3.6 GHz, with 64 GB of random-access memory (RAM).

**Table 10.** Computational time of the proposed method on feature extraction (FE), SVM classification time, postprocessing (PP) time and total time. Experiments were conducted on the Vistex database (54 images).


#### **4. Discussion**

The idea of combining color and texture was proposed previously, but the proposed feature extraction process allows the method to preserve the information available in the original image, yielding significantly better results than those previously published. A possible drawback of previous texture classification methods is that important information is lost from the original image with the feature extraction method, hampering its ability to improve texture classification results. The feature extraction process that includes global and local features is something new from the point of view of combining color with texture. Sub-dividing the training partition of the database into sub-images, and trying to obtain all the information present in the image using various image sizes or a different number of images is something that was not reported in previous publications.

Color and texture features are extracted in order to classify complex colored textures. However, the feature extraction process loses part of the information present in the image because the two-dimensional (2D) information is transformed into a reduced space. By using global and local features extracted from many different partitions of the image, the information needed for colored texture classification is preserved better. Sub-dividing the training data into sub-images (local–global) and trying to obtain all the information present using different image sizes is a new approach.

Although the BQMP method was proposed several years ago [6], it was used in color data compression, color edge detection, and multiclass clustering of color data. The reduction of an image into representative colors and a histogram that indicates which part of the image is represented by these colors achieves excellent results. In addition, local and global features are extracted from each image. The results of our method were compared with those of several other feature extraction implementations on the Brodatz database with those published in References [1,27,28,31,43,44,50], on the Vistex database with those published in References [4,9,13,23,27,28,30,38,45,46,50], on the Outex database with those published in References [4,8,9,11,27,28,30,38,45,46,50], and on the KTH-TIPS2b with those published in References [7,8,31,34,35] (please see Tables 2, 4, 6 and 8). The proposed method generated better results than those that were published previously.

The databases Brodatz, Vistex, Outex, and KTH2b-Tips are available for comparing the results of different texture classification methods. Tests should be performed under the same conditions. We compared our results with those of References [1,7] under the same conditions using the same training/test distribution, and an SVM as a classifier. We also compared our results with those of Reference [4] in which they used a nearest-neighbor classifier (KNN) and, therefore, we tested our method with KNN instead of SVM. The results with KNN are shown in Tables 2, 4, 6 and 8, corroborating that SVM achieves better results. The proposed method achieved better results than those previously published.

#### **5. Conclusions**

In this paper, we presented a new method for classifying complex patterns of colored textures. Our proposed method includes four main steps. Firstly, the image is divided into local and global images. This image sub-division allows feature extraction in different spatial scales, as well as adding spatial information to the extracted features. Therefore, we capture global and local features from the texture. Secondly, texture and color features are extracted from each divided image using the BQMP and Haralick algorithms. Thirdly, a support vector machine is used to classify each image with the extracted features as inputs. Fourthly, a post-processing stage using bagging is employed.

The method was tested on four databases, the Brodatz, VisTex, Outex, and KTH-TIPS2B databases, yielding correct classification rates of 97.63%, 97.13%, 90.78%, and 92.90% respectively. The post-processing stage improved the results to 99.88%, 100%, 98.97%, and 95.75%, respectively, for the same databases. We compared our results on the same databases to the best previously published results finding significant improvements of 8.85%, 0.93% (to 100%), 4.12%, and 4.45%.

The partition of the image into local and global images provides information about features at different scales and spatial locations within each image, which is useful in color/texture classification. The above, combined with the use of a post-processing stage using a bagging predictive model, allows achieving such results.

**Author Contributions:** Conceptualization, C.F.N. and C.A.P.; methodology, C.F.N. and C.A.P.; software, C.F.N.; validation, C.F.N. and C.A.P.; formal analysis, C.F.N. and C.A.P.; investigation, C.F.N. and C.A.P.; resources, C.A.P.; data curation, C.F.N.; Writing—Original Draft preparation, C.F.N. and C.A.P.; Writing—Review and Editing, C.F.N. and C.A.P.; visualization, C.F.N.; supervision, C.A.P.; project administration, C.A.P.; funding acquisition, C.A.P.

**Funding:** This research was funded by FONDECYT, grant number 1191610 from CONICYT, the Department of Electrical Engineering, and the Advanced Mining Technology Center, Universidad de Chile.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Partial Order Rank Features in Colour Space**

#### **Fabrizio Smeraldi 1,2,\*, Francesco Bianconi <sup>2</sup> , Antonio Fernández <sup>3</sup> and Elena González <sup>3</sup>**


Received: 26 November 2019; Accepted: 31 December 2019; Published: 10 January 2020

**Abstract:** Partial orders are the natural mathematical structure for comparing multivariate data that, like colours, lack a natural order. We introduce a novel, general approach to defining rank features in colour spaces based on partial orders, and show that it is possible to generalise existing rank based descriptors by replacing the order relation over intensity values by suitable partial orders in colour space. In particular, we extend a classical descriptor (the Texture Spectrum) to work with partial orders. The effectiveness of the generalised descriptor is demonstrated through a set of image classification experiments on 10 datasets of colour texture images. The results show that the partial-order version in colour space outperforms the grey-scale classic descriptor while maintaining the same number of features.

**Keywords:** mathematics of colour and texture; hand-designed image descriptors; rank features; partial orders

#### **1. Introduction**

It is, at first sight, peculiar that one of the most robust tools for image description, namely rank features, have only seen limited application to colour images. The problem is, of course, that while they are very effective at dealing with noise, rank features run afoul of the main theoretical difficulty associated with colour spaces—that is the absence of a natural order.

In recent years there has been a revival of interest in ranking of colour pixels. Notably, Ledoux et al. [1] published an extensive comparative study in the use of total orders as rank features for texture recognition. However, interest has been keenest in the field of colour morphology, where several solutions have been proposed—from adaptive orders that work around the 'false colour problem' to the natural mathematical structure for ordering higher-dimensional sets—that is partial orders [2–6].

In partially-ordered sets we simply admit that there will be couples of elements incomparable to each other. Partial orders are therefore particularly suitable for dealing with colour spaces, where statements like "yellow is greater than green" make little or no sense at all.

The objective of this work is to introduce a novel category of rank features based on partial orders. In the remainder, after providing some background on partial orders (Section 2), we detail the ways in which rank features can be defined (Section 2.5) and extend a classical descriptor (the Texture Spectrum) to work with partial orders (Section 3.1). We demonstrate the feasibility of the method through a set of experiment on 10 datasets of colour texture images (Section 3.2) and show that partial orders in colour space can outperform grey-scale total ordering (Section 4).

#### **2. Background**

#### *2.1. Rank Features*

Rank features are a well established technique for dealing with noise in images, enforcing invariance to all sorts of contrast or illumination variations and sensor nonlinearities [7]. Because of their robustness, they were first developed in the context of wide-baseline stereo matching—see for instance the census and rank transforms [8]. More recently, descriptors in the popular Local Binary Pattern (LBP) family, including Texture Spectrum, Binary Gradient Contours, etc. [9,10] have turned rank features into a general purpose tool, with applications—among others—in texture classification, face recognition, surface inspection and content-based image retrieval [11]. The descriptive power of rank features has been expanded to explicitly capture orientation (Ranklets [12]) and second-order stimuli (Variance Ranklets [13]), all types of information that were seen as the preserve of linear filters or ad-hoc algorithms.

Common to all rank features is the fact that they are defined in terms of ordinal information between pixels only, with the actual pixel values being discarded. This can be done in terms of pairwise pixel comparisons (rank and census transform, LBP), pixel ranks (Ranklets) or a permutation of ranks (Variance Ranklets), but it is easy to see that the two approaches are equivalent [12] and that all descriptors rely on the natural order relation (≤) between pixel values.

Before proceeding to definitions it is worth noting that, notwithstanding the trend towards the use of convolutional neural networks as feature extractors [14], rank features are still competitive in texture applications [15]. In the following section we recall the axioms for an order relation.

#### *2.2. Order Relations*

An order relation is an abstraction of the common notion of "greater than" used to compare numerical values, in our case pixel values in P (typically the set of 8-bit intensity values). In order to be called a (total) order, a binary relation ≤ needs to satisfy the following four conditions:

#### **Definition 1** (Order axioms)**.** *For all* (*x*, *y*, *z*) ∈ P 3 *,*


The last condition guarantees that we know how to compare any pair of pixel values.

#### *2.3. Ordering High-Dimensional Data*

The application of rank features to multi-channel images or higher dimensional data is hindered by the fact that there is no natural way of ordering multivariate data. It is certainly possible to provide a total order for a colour space; for instance, one could order RGB data lexicographically using the R channel as the primary sorting key, followed by G and finally by B. However, like other similar options, this has a disadvantage, namely, there are colours that are very close to each other in colour space, but very far in the order—and is therefore of limited practical interest (a sub-relation of the lexicographical order, the *product order*, is indeed of practical interest and will be discussed in detail in this paper, see Section 2.5.1). In general, it is best to resort to some sort of sub-ordering principle. These can be broadly divided in four categories [16]:


In marginal ordering, ranking is carried out on one or more components (*marginals*) of the multivariate data. Ranking colour data in the RGB space by the value of red is an example of M-ordering; lexicographical ordering is another one. Reduced (aggregate) ordering relies on converting multivariate data to univariate through suitable transformations. A common way to do this consists of establishing a reference point in the data space and using the distance from that point to rank the data. Conditional ordering occurs when we sort a random multivariate sample based on the corresponding (usually marginally-sorted) values of another sample. C-ordering is closely related to the concept of *concomitants* in Statistics [17]. Partial ordering will be discussed in detail in Section 2.5.

Interestingly, many of the common ways of dealing with order in colour space fall under the first two categories, i.e., marginal and reduced (aggregate) ordering. For instance, ranking based on intensity can be seen as a marginal ordering of the HSV space along the V axis; or as a reduced or aggregate ordering over the RGB space, where the aggregating function is the grey-level intensity. Other examples of aggregating orders will be given in Section 3.1.

In this paper, we will focus on rank features based on *partial orders*. Before introducing these, we review recent approaches to using multivariate orders on colour images.

#### *2.4. Rank-Based Approaches to Colour Processing*

Previous approaches to rank-based colour features typically extend grey-scale rank-based methods to the colour domain by considering either the colour channels separately (*intra*-channel features) and/or in pairwise combination (*inter*-channel features). Mäenpää and Pietikäinen [18] for instance extended classic LBP by applying it both to each R, G and B colour channel separately and pairwise between each of the R–G, R–B and G–B pairs. Bianconi et al. [19] adopted the same approach for extending grey-scale ranklets [12] to the colour domain. Lee et al. [20] defined Local Colour Vector Binary Patterns (LCVBP) by decomposing the colour triplets into a norm and angular component and by computing LBP on each of them. More recently, Cusano et al. [21] introduced Local Angular Patterns (LAP) which consider the angular component only and discard the norm part altogether.

Another possible strategy consists of establishing some sort of *a priori* total ordering on the colour data. This approach is not uncommon in colour morphology—see for instance Angulo [4], van De Gronde and Roerdink [6]—and has been advocated for extending LBP to colour images by Barra [22]. Of late, this family of methods has been extensively investigated by Ledoux et al. [1] and Bello-Cerezo et al. [23]. The problem is that imposing a total ordering on the colour data inevitably entails a certain degree of arbitrariness, with the consequence that the results tend to be dataset-dependent. On the other hand, morphology for tensor-valued images (that arise from certain magnetic resonance techniques) has relied on the Loewner order, that is in fact a partial order (see for instance Burgeth et al. [24]; more on this in Section 2.5.2). More recently, this approach has been extended to morphology for colour images Burgeth and Kleefeld [5]. Partial ordering circumvents the problem of ordering multivariate data totally, at the expense of not allowing comparisons between some colour values.

#### *2.5. Partial Orders*

A partial order differs from a total order in that the fourth axiom in Definition 1 is waived, i.e., there are pairs of elements in the set that are incomparable. In order to distinguish this from a total order we use the notation *x y*. If the elements *x* and *y* are incomparable, we shall write *x y*.

In the following, we describe two types of partial orders that are applicable to colour spaces with Cartesian and polar coordinates respectively.

#### 2.5.1. Product Order

By product order we mean the relation obtained from the component-wise comparison of colour values. Given **u** = (*c*1*u*, *c*2*u*, *c*3*u*) and **v** = (*c*1*v*, *c*2*v*, *c*3*v*) two triplets representing colours in a generic space we write:

$$\begin{cases} \mathbf{u} \preceq\_{\times} \mathbf{v} & \text{if } c\_{1u} \le c\_{1v}, c\_{2u} \le c\_{2v}, c\_{3u} \le c\_{3v} \\ \mathbf{u} \not\sim\_{\times} \mathbf{v} & \text{if neither } \mathbf{u} \preceq\_{\times} \mathbf{v} \text{ nor } \mathbf{v} \preceq\_{\times} \mathbf{u}. \end{cases} \tag{1}$$

Note that this is a subset of the lexicographical order introduced in Section 2.3; it is however of higher practical interest as it treats all three channels symmetrically. In the RGB space, for instance, a given colour **u** weakly dominates the rectangular parallelepiped C(**u**) with three edges along the axes and a vertex in the colour itself (see Figure 1). For any colour **v** that does not dominate all of C(**u**), **v u**.

**Figure 1.** Product order in the RGB space: A generic colour (*r*0, *g*0, *b*0) dominates all the colours in the blue volume and is dominated by all the colours in the red volume.

The product order can of course be applied to any colour space, giving relations of various degree of interpretability and effectiveness for pattern recognition (see Sections 3.1 and 4).

#### 2.5.2. Loewner Order

The Loewner (partial) order is defined on symmetric matrices. Given two symmetric matrices **A**, **B** we write:

$$\begin{cases} \mathbf{A} \preceq\_{\mathcal{L}} \mathbf{B} & \text{if } (\mathbf{B} - \mathbf{A}) \in \mathcal{S}\_{+} \\ \mathbf{A} \not\simeq\_{\mathcal{L}} \mathbf{B} & \text{if neither } \mathbf{A} \preceq\_{\mathcal{L}} \mathbf{B} \text{ nor } \mathbf{B} \preceq\_{\mathcal{L}} \mathbf{A} \end{cases} \tag{2}$$

where S<sup>+</sup> indicates the set of positive semi-definite matrices. Applying this to a colour space requires mapping colour values to symmetric matrices. Following [5], we start from a modified colour space HCL obtained from HSL ([ ˜ 25], Section 4.6) by setting *<sup>L</sup>*˜ <sup>=</sup> <sup>2</sup>*<sup>L</sup>* <sup>−</sup> <sup>1</sup> for the (modified) luminance and replacing saturation with chroma *C* = max{*R*, *G*, *B*} − min{*R*, *G*, *B*}. The resulting colour gamut fills a bicone with axis L and opening angle 90 ˜ ◦ . We isometrically map colours to the space *Sym*(2) of symmetric 2 × 2 matrices by setting [5]:

$$\mathbf{M}\left(h,c,\hat{l}\right) = \frac{1}{\sqrt{2}} \begin{pmatrix} \mathbb{I}-c & h\\ h & \mathbb{I}+c \end{pmatrix} . \tag{3}$$

For two colours **u** = *hu*, *cu*, ˜ *lu* and **v** = *hv*, *cv*, ˜ *lv* in the HCL space we therefore write: ˜

$$\begin{cases} \mathbf{u} \preceq\_L \mathbf{v} & \text{if } \mathbf{M} \left( h\_{\nu}, c\_{\nu} \tilde{l}\_{\boldsymbol{\nu}} \right) \preceq\_{\mathcal{L}} \mathbf{M} \left( h\_{\boldsymbol{\nu}\boldsymbol{\nu}} c\_{\boldsymbol{\nu}\boldsymbol{\nu}} \tilde{l}\_{\boldsymbol{\nu}} \right) \\ \mathbf{u} \prec\_L \mathbf{v} & \text{if neither } \mathbf{u} \preceq\_L \mathbf{v} \text{ nor } \mathbf{v} \preceq\_L \mathbf{u} \end{cases} \tag{4}$$

where <sup>L</sup> is defined in Equation (2). Geometrically (Figure 2), a given colour **v** weakly dominates all colours of lower luminance that fall in a cone with its vertex in **v** and its axis parallel to the *L*˜ axis.

#### **3. Materials and Methods**

#### *3.1. Rank Features on Partial Orders*

In this section we show how to generalise existing rank-based descriptors by replacing total order in grey-scale with suitable partial orders in colour space. In the remainder we shall use the Texture Spectrum [26] as our reference model—though other descriptors such as Local Binary Patterns and Local Ternary Patterns are amenable to the same procedure with virtually no effort.

In Texture Spectrum, a local image pattern P = {**p**0, **p**1, . . . , **p***n*} is assigned a unique decimal code as follows:

$$f\_{\rm TS} \left( \mathcal{P} \right) = \sum\_{i=1}^{n} \mathfrak{Z}^{i} \tau \left[ \mathfrak{g} \left( \mathbf{p}\_{0} \right), \mathfrak{g} \left( \mathbf{p}\_{i} \right) \right] \tag{5}$$

where **p**<sup>0</sup> represents the central pixel and **p***<sup>i</sup>* , *i* ∈ {1, . . . , *n*} the peripheral pixels, which we assume to be arranged on a circle around the central pixel. We also assume that **p** represents a point in a 3D colour space, though again extension to multi-spectral data is straightforward. In Equation (5) the function *g* (**p**0) stands for a generic conversion from colour into grey-scale, whereas *τ* (*u*, *w*) indicates the ternary thresholding function:

$$\pi \left( w, z \right) = \begin{cases} 0 & \text{if } w < z \\ 1 & \text{if } w = z \\ 2 & \text{if } w > z \end{cases} \tag{6}$$

An image is represented by the dense, orderless statistical distribution over the set of possible codes. For Texture Spectrum, the number of (directional) features generated by the method is clearly 3 *n* . Invariance under rotations and/or reflections can be obtained by grouping together all those codes that represent patterns which can be transformed into one another by such transforms. The corresponding mathematical structures are *necklaces* and *bracelets*, respectively for invariance under rotations (i.e., cyclic group of order *n*; *Cn*) and under rotations + reflections (i.e., dihedral group of order *n*; *Dn*). For general formulas about the number of resulting *Cn*- and *Dn*-invariant features and for other mathematical details please refer to González et al. [27], Zelenyuk and Zelenyuk [28]. Specifically, for *n* = 8 (which is the case considered herein—see below) the number of features is respectively 834 and 498.

A ternary rank feature for partially ordered data analogous to Texture Spectrum—the Partial Order Texture Spectrum (POTS)—can easily be defined in the following way:

$$f\_{\text{POTS}}\left(\mathcal{P}\right) = \sum\_{i=1}^{n} \mathfrak{Z}^{i} \mathfrak{g}\left(\mathbf{p}\_{0\prime}\mathbf{p}\_{i}\right) \tag{7}$$

$$\varphi\left(\mathbf{u},\mathbf{v}\right) = \begin{cases} 0 & \text{if } \mathbf{u} \prec \mathbf{v} \\ 1 & \text{if } \mathbf{u} \succeq \mathbf{v} \\ 2 & \text{if } \mathbf{u} \not\sim \mathbf{v} \end{cases} \tag{8}$$

where indicates a generic partial order relation in the colour space (see Section 2.5). Notably, the number of features generated by this formulation is the same as generated by the Texture Spectrum.

In the experiments we considered the following partial order/colour space combinations: product order (Section 2.5.1) in the RGB, Ohta's and opponent spaces [25]; Loewner order (Section 2.5.2) in the HCL space. When reporting experimental results we use subscripts 'RGB', 'ohta' and 'opp' to indicate ˜ the colour spaces, and superscripts <sup>×</sup> and *<sup>L</sup>* respectively for the product and Loewner orders (see Equations (1) and (4)). No superscript was used to indicate the natural total order on greyscale values.

Conversion from RGB to grey-scale was also performed in three different ways: (1) through the standard PAL/NTSC formula ([25], Section 4.3.1); (2) by computing the average of the three channels; and (3) by determining, for each image, the principal axes of the colour distribution in the RGB space and projecting each (*r*, *g*, *b*) triplet onto the first axis. In the remainder we denote the corresponding variations of Texture Spectrum respectively as TSgrey, TS*<sup>µ</sup>* and TSp1.

Finally, we computed *Cn*- and *Dn*-invariant features over 3 × 3, non-interpolated, square neighbourhoods of radius 1px and 2px. The overall feature vector was obtained by concatenating the feature vectors obtained at each resolution—see also González et al. [27] for details. These settings respectively generates 834 × 2 = 1668 and 498 × 2 = 996 features.

#### *3.2. Experiments*

To test the effectiveness of the partial-order rank features described in Section 3.1 we ran a set of supervised image classification experiments. Datasets, classification strategy and accuracy estimation are described in the following subsections.

#### *3.3. Datasets*

We used ten datasets of colour texture images from different sources as described below. The main properties of each dataset are summarised in Table 1.


**Table 1.** Datasets used in the experiments: round-up table.

KEY TO SYMBOLS: = illumination, = rotation, = scale.

#### 3.3.1. Epistroma

Contains 1376 histopathological images from colorectal cancer representing either *epithelium* (825 images) or *stroma* (551 images). The image size ranges from 93 px to 2372 px in width and from 94 px to 2373 px in height. Further details about tissue preparation and digitisation procedure are available in Linder et al. [29].

#### 3.3.2. KTH-TIPS

Includes 10 classes of common materials (e.g., *aluminum foil*, *bread*, *corduroy*, etc.) with 81 image samples for each class [30,31]. Each material was acquired under nine scales, three rotation angles and three lighting directions.

#### 3.3.3. KTH-TIPS2b

Features 11 classes of materials (432 sample images per class) and is actually an extension of KTH-TIPS. The image acquisition settings were the same as in KTH-TIPS, but four rather than three illumination conditions were used in this case [32].

#### 3.3.4. Kylberg–Sintorn

Is composed of 25 classes of heterogeneous materials, such as food (e.g., *lentils*, *oatmeal* and *sugar*), fabric (e.g., *knitwear* and *towels*) and tiles [33,34]. For each class one sample image was acquired using invariable illumination conditions and under nine different rotation angles—of which only the images at 0◦ were included in our experiments. Each image was further subdivided into six non-overlapping sub-images of dimension 1728 × 1728 px.

#### 3.3.5. MondialMarmi

Comprises 25 classes of marble and granite products identified by their commercial denominations, e.g., *Azul Platino*, *Bianco Sardo*, *Rosa Porriño* and *Verde Bahía* [35]. Each class is represented by four tiles; ten images for each tile were acquired under steady illumination conditions and at rotation angles from 0◦ deg to 90◦ in steps of 10◦ . In the experiments we only used the images at 0◦ ; moreover, we subdivided each image into four non-overlapping sub-images therefore obtaining 16 image samples for each class.

#### 3.3.6. OUTEX-13 and OUTEX-14

Are based on the same sets of images that respectively make up the OUTEX\_ TC\_00013 and OUTEX\_TC\_00014 test suites—see Ojala et al. [36] for details. Specifically, OUTEX-13 features 68 classes of materials with 20 images per class acquired under invariable illumination conditions; OUTEX-14 contains the same classes—but in this case the image samples were acquired under three different illumination conditions—therefore there are 60 samples per class. Please notice, however, that in order to maintain the same evaluation protocol for all the datasets considered here (see Section 3.4), the subdivisions into train and test sets used in our experiments were not the same as in the OUTEX\_TC\_00013 and OUTEX\_TC\_00014 test suites.

#### 3.3.7. Pap Smear

Consists of 917 PAP-stained images of variable dimension representing cells from the cervix [37]. The images represent either *abnormal* cases—675 samples or *normal* cases—242 samples. The dataset also comes with a further subdivision into seven sub-classes which was not considered in our experiments. The image size ranges from 84 × 88 px to 392 × 262 px. In our experiments we considered a balanced sub-set containing 204 samples for each of the two classes.

#### 3.3.8. Plant Leaves

Includes a total of 1200 samples of plant leaves from 20 different classes with 60 samples per class [38]. The images were acquired using a planar scanner and have a dimension of 128 × 128 px.

#### 3.3.9. RawFooT

Comprehends 68 classes of raw food and grains such as *corn*, *chicken breast*, *pomegranate*, *salmon* and *tuna* [39,40]. The materials were acquired under 46 different illumination conditions resulting in as many image samples for each class. We further subdivided the images into four non-overlapping sub-images, thus obtaining 184 samples for each class. The dimension of the resulting image tiles was 400 × 400 px.

#### *3.4. Classification and Accuracy Estimation*

For each dataset described in Section 3.3 we performed supervised classification using a nearest neighbour classifier (1-NN) with the *L*<sup>1</sup> ('Manhattan') distance. In detail, after extracting a feature vector from all images according to one of the descriptors tested, we computed the distance between such vectors as the sum of the absolute differences between components. We then assigned each test vector to the class of the closest training vector. The absence of tuning parameters, the ease of implementation and other desirable asymptotic properties make the 1-NN particularly appealing for comparison purposes. Its use in related works is indeed customary: see for instance Cusano et al. [39], Kandaswamy et al. [41], Liu et al. [42].

Accuracy estimation was based on split-half validation with stratified sampling—for each dataset we used half of the samples of each class to train the classifier (*train set*) and the other half (*test set*) to compute the accuracy. This was defined as the ratio between of number of samples of the test set correctly classified (*Nc*) and the total number of samples of the test set (*N*):

$$a = \frac{N\_c}{N}.\tag{9}$$

For a stable estimation we averaged the above value over a hundred different subdivisions into train and test set:

$$
\hat{a} = \frac{\sum\_{i=1}^{100} a\_i}{100} \,\text{\AA} \tag{10}
$$

where *a<sup>i</sup>* indicates the accuracy achieved in the *i*-th subdivision into train and test set. In Table 2 we report the 95% confidence intervals for *a*ˆ (computed under the simplifying assumption of normal distribution).

**Table 2.** Overall classification accuracy: confidence intervals for the cross-validated accuracy *a*ˆ. Best results highlighted for grey-level (orange) and colour space features (blue). Boldface figures indicate statistically significant differences.


#### **4. Results and Discussion**

Table 2 reports the confidence intervals for the means of the overall classification accuracy (see Section 3.4). For each dataset we highlighted in orange the best result obtained by total-order grey-scale rank features; in blue the best result obtained by partial order rank features (POTS) in colour space. When there was a statistically significant difference between the two, the best figure was indicated in boldface. As can be seen, partial order rank features in colour space performed significantly better in five datasets out 10, whereas the reverse occurred in one dataset only (dataset six). In the remaining four datasets there was no significant difference between the two methods.

As for grey-scale rank features, the results show that in most cases (i.e., 8 datasets out of 10) the best performance was obtained using standard PAL/NTSC grey-scale conversion. By contrast, partial-order rank features denoted a higher dependence on the colour space used.

The computational cost of all the descriptors considered is roughly equivalent, as the number of features is the same and the complexity of computing a partial or total order comparison in colour space is comparable to the cost of a colour space transformation. Indeed, as we have just described, even the traditional TS requires a grey-scale conversion, the choice of which can be seen as an integral part of the descriptor.

In Table 3 we compare our results to published results obtained using rank-based descriptors in conjunction with other ordering methods in colour spaces. As can be seen, in most cases our partial-order based approach improves significantly over previous results. We should here emphasise that the computational requirements of our partial-order descriptors are not higher than those of the other ordering methods cited.

**Table 3.** Comparison with the results obtained by other ordering methods as reported in the references indicated. Key to symbols: 'cvn' = colour vector norm, 'lex' = lexicographic ordering, 'rcl' = preorder based on white as reference colour. Please refer to the cited works for further details.


#### **5. Conclusions and Future Work**

The lack of a natural order among colours represents an intrinsic impediment to the definition of rank features in colour space. In this paper we have introduced a novel and general approach based on partial orders. Partial orders overcome the problems inherent to ordering multivariate data at the expense of admitting that not all pairs of colours can be compared to each other. We showed that this scheme fits in well with existing grey-scale local image descriptors, that are amenable to extension to the colour domain with little effort. Taking the Texture Spectrum as a model, we showed that its partial-order version in colour space (POTS) can outperform the grey-scale classic descriptor while maintaining the same number of features and with comparable computational complexity. Previous studies have also demonstrated that the use of colour can improve texture discrimination, but at the expense of employing a higher number of features [44–46]. Notably, our approach improves on published results that use descriptors based specifically on (total) colour space ordering (see Table 3).

To the best of our knowledge this is the first time that partial orders have been used to define rank features for pattern recognition. The method is conceptually simple, fairly general and shows potential for application in a wide number of computer vision tasks. Future studies will be focussed on extending the approach to the broader class of descriptors known as Histograms of Equivalent Patterns [9]. The effect of the colour space on the performance of rank features based on partial orders is also an important topic for further investigation. Finally, the insertion of partial order based algorithms in more involved image processing pipelines (e.g., convolutional neural networks) also represents an interesting opportunity for future research; integration at the level of matching [47] has so far been successful.

**Author Contributions:** Conceptualization, F.S. and F.B.; Formal analysis, F.S., F.B. and A.F.; Methodology, F.S., F.B., A.F. and E.G.; Software, F.S. and F.B.; Validation, F.S., F.B., A.F. and E.G.; Visualization, F.B. and A.F.; Writing—original draft, F.S. and F.B.; Writing—review & editing, A.F. and E.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by the Spanish Government under projects AGL2014-56017-R and TIN2014-56919-C3-2-R, and by the Department of Engineering at the Università degli Studi di Perugia (UniPG Eng), Italy, under project *Machine learning algorithms for the control of autonomous mobile systems and the automatic classification of industrial products and biomedical images* (Fundamental resarch grants 2017). F.S. performed part of this work as a

Visiting Researcher at UniPG Eng. He gratefully acknowledges the support of UniPG under international mobility grant 'D.R. n.2270/2015'.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Detection of Tampering by Image Resizing Using Local Tchebichef Moments**

**Dengyong Zhang 1,† , Shanshan Wang 1,†, Jin Wang 1,\* , Arun Kumar Sangaiah <sup>2</sup> , Feng Li <sup>1</sup> and Victor S. Sheng <sup>3</sup>**


Received: 30 June 2019; Accepted: 23 July 2019; Published: 26 July 2019

**Abstract:** There are many image resizing techniques, which include scaling, scale-and-stretch, seam carving, and so on. They have their own advantages and are suitable for different application scenarios. Therefore, a universal detection of tampering by image resizing is more practical. By preliminary experiments, we found that no matter which image resizing technique is adopted, it will destroy local texture and spatial correlations among adjacent pixels to some extent. Due to the excellent performance of local Tchebichef moments (LTM) in texture classification, we are motivated to present a detection method of tampering by image resizing using LTM in this paper. The tampered images are obtained by removing the pixels from original images using image resizing (scaling, scale-and-stretch and seam carving). Firstly, the residual is obtained by image pre-processing. Then, the histogram features of LTM are extracted from the residual. Finally, an error-correcting output code strategy is adopted by ensemble learning, which turns a multi-class classification problem into binary classification sub-problems. Experimental results show that the proposed approach can obtain an acceptable detection accuracies for the three content-aware image re-targeting techniques.

**Keywords:** image resizing; local Tchebichef moments (LTM); scaling; scale-and-stretch; seam carving

### **1. Introduction**

As image editing tools and various mobile devices are easily acquired and conveniently used, maximizing the viewing experience of end users on small devices becomes very important. Compared to traditional image re-targeting methods, such as linear scaling and cropping, many content-aware image resizing methods can preserve salient areas, avoiding serious distortions or loss of significant information[1–3]. Meanwhile, many content-aware resizing algorithms have been adopted using image editing tools, such as photoshop and GIMP. An ordinary user can very easily create tampered images for malicious purposes using image editing tools. Moreover, it is impossible to distinguish those tampered images from authentic images with the naked eye. Therefore, how to detect tampered images is a hot topic in the field of image content security.

In recent years, a few approaches have existed about the detection of content-aware image re-targeting. Moreover, most of the detection methods are for the seam carved images. Lu et al. adopted a forensic hash to distinguish whether an image is subjected to a seam carving operation [4]. However, it is an active forensics approach; moreover, a falsifier might remove the forensic hash. For passive forensic detection, Sarkar et al. used 324D Markov features to detect image seam carving [5]. Later, Fillion et al. exploited a series of intrinsic features to expose the trace of seam carved images [6]. In Wei et al. [7], an approach based patch analysis was adopted to distinguish whether an image is original or not. According to noise and energy distribution in seam carved images, Ryu et al. [8] exploited the features based on noise and energy bias to detect seam carved images. Local binary pattern (LBP) was adopted to detect seam carved images [9] in our recent work. Inspired by image entropy with the ability of capturing the intrinsic information of an image, we exploited multi-scale spectral and spatial entropies to detect seam carved images with low resizing ratios [10]. Web Local Descriptor (WLD) and LBP were adopted to distinguish whether an image is original or not [11]. In [12], a large feature mining approach was proposed to detect image seam carving under recompression in joint photographic experts group (JPEG) images.

However, most existing detections of image resizing are designed for a specific content-aware resizing. Much less has been done to distinguish different content-aware resizing approaches in the process of image re-targeting. In practice, the best re-targeting method relies on an image itself. For example, scaling images in horizontal or vertical direction can be performed in real time using interpolation and will preserve the global visual effects and retarget images with medium perceptual quality. However, scaling will introduce some shape deformation into the retarget image. Seam carving [1] supports various visual saliency measures for defining the energy of an image. Nevertheless, seam carving can excessively carve less important parts of an image and result in unwanted visual distortions. scale-and-stretch [3]can preserve the aspect ratios of local objects. However, if there are many quads in the image, the approach will fail to preserve the aspect ratio of the whole image [2]. Therefore, there are different image resizing methods depending on the image content to achieve image change in size while preserving the saliency region. it is necessary to propose a universal detection of image resizing.

The rest paper is organized as follows: Section 2 summarizes several common methods of image re-targeting and analyzes their artifacts. Section 3 briefly introduces the proposed detection approach. Our experimental results are described and analyzed in detail in Section 4, and conclusions are made in Section 5.

#### **2. Image Resizing and Their Possible Artifacts**

#### *2.1. Several Common Methods of Image Resizing*

Among the methods of content-aware image resizing, scaling, seam carving [1], and scale-and-stretch [3] are three common approaches to re-target an image. Seam carving is defined by forward energy. The intensity gradient magnitude in *L*<sup>1</sup> metric is defined as an importance map. The contiguous chains of pixels that pass through the regions of the least importance in the image are deleted or duplicated by seam carving to obtain image resizing. Dynamic programming is adopted to compute seams. Scale-and-stretch is defined warping. Both image dimensions are processed by warping at once. Moreover, an objective function is optimized to allow important regions uniformly scaling in order to preserve their shapes. A saliency and combination of *L*<sup>2</sup> gradient magnitude (defined by [13]) is defined as the importance map. Scaling implemented image resizing by simply bi-cubic interpolation and non-uniform scaling.

According to the above description of the three resizing methods, it is clearly found that scale-and-stretch can keep significant regions in an image, which is consistent with its original image after the image is re-targeted. However, seam carving implements image re-targeting by deleting or inserting pixels within a minimal energy; therefore, it may cause the salient object distortion.

#### *2.2. Analysis of Image Resizing Artifacts*

There exist three kinds of artifacts for the image processed by the content-aware resizing method [13], such as geometric deformation, information loss and local texture distortion. Figure 1 shows these artifacts caused by content-aware re-targeting. Figure 1b shows line or edge distortion after re-targeting. However, salient areas of an image, such as people and building, do not significantly change. The shape distortion of an image is shown in Figure 1c. This further explains that removed pixels might exist in salient areas of the image when a pixel with a minimum energy is deleted in the process of the image seam carving. Therefore, it is easy to deform important objects of the image in the process of image resizing. Figure 1d shows the artifact of information loss. A scaling method is bi-cubic interpolation and re-targets the entire image in the process of image re-targeting. To better show the image distortion in the process of image resizing, we adopt an LTM histogram to identify the distortion caused by different resizing methods in this paper. Figure 2 shows the residual LTM diagram for three different resizing methods. It can be found from Figure 2b that the image distortions of a non-subject area are not easily perceived in Figure 1b. However, these distortions can be clearly found in the residual LTM diagram. It can be clearly showed that the distortions are found in the process of image re-targeting in Figure 2d.

(**c**) seam carving method (**d**) scaling method

(**a**) original image (**b**) scale-and-stretch method

**Figure 1.** Resized image obtained by different retargeting methods.

**Figure 2.** Residual LTM (local Tchebichef moments) diagram obtained by different re-targeting methods: (**a**–**d**) correspond to the residual LTM diagram of Figure 1a–d, respectively.

## **3. Proposed Method**

A passive detection method is presented for image resizing forgery detection in this paper. Figure 3 shows the implementation process diagram of our presented algorithm. Similar to the process of most existing forensics methods, our proposed method consists of two parts, i.e., a training part and a testing part. In the training process, tampered images and their corresponding original images are adopted as data sets. First, all the training images are preprocessed. Second, the LTM histogram features are extracted from preprocessed images. Finally, a training model is obtained by using ensemble learning based on extracted features. In the testing process, the LTM histogram features are extracted according to the same steps in the training part. Finally, the extracted features are used by the trained ensemble classifier to distinguish which resizing method a tested image is re-targeted.

**Figure 3.** A block diagram of our proposed approach.

#### *3.1. Preprocessing*

An image obtained by content-aware resizing methods usually has a good visual effect. Furthermore, it is impossible for users to distinguish from authentic photographs using the naked eye. However, the correlation between adjacent pixels will inevitably change after an image is resized. Therefore, it is necessary to preprocess re-targeted images. Image residuals can efficiently capture the change of adjacent pixels in the process of image re-targeting. In this paper, a one-dimensional low-pass filter is adopted to calculate residuals along the horizontal and the vertical directions. The formula is shown as Equation (1):

$$R(\mathbf{x}, y) = I(\mathbf{x}, y) - I(\mathbf{x}, y) \* L(\mathbf{u}), \tag{1}$$

where *I*(*x*, *y*) is an image and *L*(*u*) is low-pass filter. Figure 4 shows the residuals of the preprocessed content-aware image resizing. Through the residuals, it can be clearly found that tampered traces are caused by different content-aware image resizing methods.

**Figure 4.** Tampered images and correspond residual images obtained by different re-targeting methods: (**d**–**f**) corresponds to the residual LTM diagram of (**a**–**c**), respectively.

#### *3.2. Features of LTM*

After images preprocessing, orthogonal Tchebichef moments are adopted to construct feature vectors on 5 × 5 neighbor pixels. In addition, the texture information is encoded with Lehmer to represent the relative strength of moments. The extracted feature vectors are called LTM. A byte value for each pixel is provided, and an LTM diagram is generated by the encoding scheme. Therefore, the histogram features of LTM are adopted to identify whether an image is subjected to image resizing. Figure 5 shows the histogram features of LTM after preprocessing.

**Figure 5.** The histogram features of LTM: (**a**–**c**) corresponds to the LTM histogram of Figure 4d–f, respectively.

#### *3.3. Ensemble Learning for Blind Forensics*

In this paper, an error-correcting output codes (ECOC) strategy [14] based on ensemble learning is adopted, which transforms multi-class classification problems into binary classification sub-problems. This is because ECOC is an excellent multi-class classification tool, and the ensemble learning performs well in terms of computational complexity and detection accuracy. The tamper of three different resizing methods, such as scale-and-stretch, seam carving and scaling, is identified. Therefore, for this three-class classification problem, a pair coupling strategy [15] is adopted. Specifically, a discrete matrix (coding matrix) is defined first. In addition, then, the problem is decomposed into *n* = 3 binary classification sub-problems according to the sequence of 0 and 1 in the coding matrix, namely dichotomies. After that, ensemble learning is adopted to train these dichotomies and test the extracted

histogram of LTM to obtain binary vectors. Finally, the class is identified by the minimum hamming distance between the encoded word and the vector.

#### **4. Results**

#### *4.1. Experimental Environment*

To verify the performance of our proposed algorithm, we conduct a number of experiments in our personal computer. The passive forensics approach is implemented in Matlab2012b. The ensemble learning can be downloaded from [16]. In this paper, the Uncompressed Colour Image Database (UCID) [17] is adopted as the original images. The image database contains 1338 images, which are composed of people, buildings, animals and landscapes. Since there is no publicly available image database of image resizing available, we construct an image library from UCID for resizing carving detection. According to different resizing ratios, three resizing methods are used to produce tampered images. The resizing ratios vary from 10% to 50% with a step size of 10%. That is, for every resizing method, the resizing ratios of tampered images are 10%, 20%, 30%, 40%, and 50%. Therefore, we have 1338 original images and 3 × 5 × 1338 tampered images. To verify our proposed approach, we perform the performance evaluation for the following cases: (1) tamper detection for a single resizing method; (2) tamper detection for multiple resizing methods; and (3) tamper detection without preprocessing. In all experiments, the ECOC based on an ensemble learning strategy is adopted to test the effect of our proposed method. The image dataset is divided into two groups, 50% for training and 50% for testing. The training and testing are repeated ten times, and the average detection accuracy is reported in this paper.

#### *4.2. Experimental Discussions*

#### 4.2.1. Tamper Forensics on a Single Resizing Method

In this experiment, we test the detection performance of our proposed method for a single resizing method. The tampered images with scaling ratios from 10% to 50% are adopted to test the performance of our proposed approach. Table 1 shows the detection results. From Table 1, we can see that the detection accuracy is improved with the increment of the scaling ratio. When the scaling ratio is less than 20%, our proposed approach has a higher detection accuracy for the scale-and-stretch resizing method. Since the optimal local scaling ratio of each local block is calculated iteratively and the warped image is updated simultaneously to match the scaling ratios as much as possible, the entire image is resized in the process of scale-and-stretch re-targeting. However, the seam carving method resizes an image by deleting the seams with the lowest energy once. Therefore, the tampered images obtained by the seam carving method are difficult to be distinguished from the authentic images when the scaling ratio is low. With the increment of the scaling ratio, the algorithm will cause a global structure distortion. It can also be reflected in Table 1. Our proposed approach can get a higher accuracy rate for images obtained using the seam carving method than that using other resizing methods with the increment of the resizing ratio.


**Table 1.** Comparisons in terms of accuracy for tampered images with single re-targeting methods.

#### 4.2.2. Identifying Images Obtained by Different Content-Aware Resizing Methods

In this experiment, the tampered images with scaling ratios from 10% to 50% are obtained by different content-aware resizing methods. They are adopted to test the performance of our proposed algorithm. The average detection accuracies of different content-aware re-targeting methods, where the average detection accuracy is the average value of diagonal elements in the confusion matrix, are summarized in Table 2. Note that "mixed" represents the mixed test set of tampered images with the scaling ratios from 10% to 50%. There are three content-aware resizing methods in this paper. Therefore, this is a four-class classification problem (the original images as a special class), according to the ECOC strategy.

**Table 2.** The average detection accuracies of our proposed approach with preprocessing.


Table 2 shows that the average accuracy is improved with the increment of the scaling ratio. However, the detection accuracy is apparently decreased for highly compressed images with Quality factor (QF) being equal to 75. Through careful analysis of our experimental results, it is found that the main reason for the decrement of the detection accuracy in the compressed condition is that the traces of the tampered images are weakened when images are compressed. Therefore, the detection accuracy is decreased in this case. We have also completed the experiment of the "mixed" tampered uncompressed and compressed images and get the confusion matrix. Table 3 shows our experimental results, where CMOMTUI represents the confusion matrix of "mixed" tampered uncompressed images, CMOMTCI represents the confusion matrix of "mixed" tampered compressed images, "\*" represents the classified accuracy less than 1%, OR represents original images, SNS represents a scale-and-stretch method, SC represents a seam carving method, and SL represents a scaling method. From Table 3, we can see that our proposed method can get a high accuracy for the three content-aware resizing methods mentioned in this paper. However, it can't obtain a good detection accuracy on the tampered images with JPEG compression. In addition, its false positive rate is relatively high for the seam carving method (SC) and the scaling method (SL).


**Table 3.** CMOMTUI and CMOMTCI with preprocessing.

#### 4.2.3. The Detection Accuracy without Preprocessing

Tables 4 and 5 report the identified results for the uncompressed and compressed tampered images without preprocessing, respectively, where CMOMTUI represents the confusion matrix of "mixed" tampered uncompressed images, CMOMTCI represents the confusion matrix of "mixed" tampered compressed images, "\*" represents the classified accuracy less than 1%, OR represents original images, SNS represents the scale-and-stretch method, SC represents the seam carving method, and SL represents the scaling method. It can be found from Tables 4 and 5 that our proposed approach has a sightly higher detection accuracy on the uncompressed images when the features of LTM are extracted from the images without preprocessing. However, when the images are compressed by QF = 75, the accuracy of our proposed approach is significantly lower than that of the images with

preprocessing. The main reason is that the residual may weaken the tampered trace of the images without compression when it is used in the process of preprocessing. However, the images with compression are preprocessed by residuals, and the changes of these images will be highlighted.


**Table 4.** The average detection accuracies of our proposed approach without preprocessing.


**Table 5.** CMOMTUI and CMOMTCI without preprocessing.

#### **5. Conclusions**

Content-aware image re-targeting methods are widely adopted to resize images to display on all kinds of terminals. However, they can be also used to make fake images, which don't have any perceptual annoying distortions. By the principle analysis of the three image resizing methods, we found that the correlation between adjacent pixels can be destroyed in the process of image resizing. Tchebichef moments have been extensively applied in field of image/vedio such as information security [18], pattern recognition [19] and image quality assessment [20], and so on. Inspired by this, it can be found from experiments that LTM can effectively reflect the correlation changes between adjacent pixels. We proposed a passive forensics algorithm based on LTM to identify tampered images obtained by content-aware image resizing methods in this paper. Our experimental results showed that our proposed method can obtain a better accuracy. It is verified that it has a good performance for the image resizing with high scaling ratios. In the future, we will try to evaluate the detection accuracy on the tampered images obtained by image resizing with low scaling ratios. In addition, our proposed method could not obtain a satisfied detection accuracy on the re-targeted images with JPEG compression. We will further analyze the tampered trace of the resized images with JPEG compression. In addition, since there are still a few image resizing methods [21,22] besides the three methods proposed in this paper, we will attempt to distinguish these image resizing techniques for re-targeted images by applying other multi-class classifiers [23–28] and designing more general features from image/video processing methods [29–38]. In view of the importance of social media digital images in practical applications, research on their authenticity, integrity and traceability has been one of the hot and challenging research topics in the field of information security. We will adopt network optimization methods [39–48] to improve the real-time and high efficiency performance of the feature extraction phases.

**Author Contributions:** Conceptualization: D.Z. and S.W.; investigation: J.W.; methodology: D.Z. and S.W.; software: D.Z. and S.W.; supervision: F.L.; validation: A.K.S. and V.S.S.; writing—original draft: D.Z. and S.W.; writing—review and editing: J.W. and A.K.S.

**Funding:** This research was funded by the National Natural Science Foundation of China (61772454, 61811530332, 61811540410, 61772087, 61232016), the Scientific Research Fund of Hunan Provincial Education Department of China (14C0029) and the "Double First-class" International Cooperation and Development Scientific Research Project of Changsha University of Science and Technology (No. 2018IC25).

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Bounded Scheduling Method for Adaptive Gradient Methods**

#### **Mingxing Tang <sup>1</sup> , Zhen Huang 1,\*, Yuan Yuan <sup>2</sup> , Changjian Wang <sup>2</sup> and Yuxing Peng <sup>1</sup>**


Received: 22 July 2019; Accepted: 28 August 2019; Published: 1 September 2019

**Abstract:** Many adaptive gradient methods have been successfully applied to train deep neural networks, such as Adagrad, Adadelta, RMSprop and Adam. These methods perform local optimization with an element-wise scaling learning rate based on past gradients. Although these methods can achieve an advantageous training loss, some researchers have pointed out that their generalization capability tends to be poor as compared to stochastic gradient descent (SGD) in many applications. These methods obtain a rapid initial training process but fail to converge to an optimal solution due to the unstable and extreme learning rates. In this paper, we investigate the adaptive gradient methods and get the insights on various factors that may lead to poor performance of Adam. To overcome that, we propose a bounded scheduling algorithm for Adam, which can not only improve the generalization capability but also ensure the convergence. To validate our claims, we carry out a series of experiments on the image classification and the language modeling tasks on several standard benchmarks such as ResNet, DenseNet, SENet and LSTM on typical data sets such as CIFAR-10, CIFAR-100 and Penn Treebank. Experimental results show that our method can eliminate the generalization gap between Adam and SGD, meanwhile maintaining a relative high convergence rate during training.

**Keywords:** deep neural networks; adaptive gradient methods; stochastic gradient descent; bounded scheduling method; image classification; language modeling

## **1. Introduction**

Deep neural networks (DNNs) [1] have achieved great successes in many applications, such as image recognition [2], object detection [3], speech recognition [4,5], face recognition [6] and machine translation [7]. How to train DNNs quickly and accurately has attracted the attention of many researchers. Training neural networks is equivalent to solving the following non-convex optimization problems:

$$\min\_{w \in \mathbb{R}^d} F(w) = \frac{1}{n} \sum\_{i=1}^n f\_i(w), \tag{1}$$

where *<sup>w</sup>* is the parameter to train, *<sup>n</sup>* is the number of instances, *<sup>f</sup>i*(·) : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup> is a loss function defined on the instance with *d* dimensions and indexed *i*. Training algorithms need to search parameters to minimize the loss function.

Stochastic gradient descent (SGD) [8] has become the dominant training algorithm for DNNs. Simple as it is, SGD performs well in many applications. SGD obtains a smaller loss by moving the parameters of the model in the negative direction of gradient evaluated on a minibatch. The iteration of SGD can be described as follows:

$$w\_k = w\_{k-1} - \eta \nabla f\_{i\_k}(w\_{k-1})\_\prime \tag{2}$$

where *η* is learning rate, *i<sup>k</sup>* is the instance index at the *k*-th iteration, ∇ *fi<sup>k</sup>* (*wk*−<sup>1</sup> ) denotes the stochastic gradient computed at *wk*−<sup>1</sup> .

There are two main drawbacks of SGD. The first one is SGD needs to find an appropriate learning rate, which means that excessive learning rate will cause the loss function unable to converge to the optimal value and exceptionally small learning rate will slow down the convergence speed of loss function. The other one is SGD scales the gradient uniformly in all directions, which leads that the ill-scaled or sparse problems cannot be solved well [9].

To train DNNs, SGD uses a standard decreasing learning rate scheme, where the learning rate is initialized as a large value at the beginning and decreases gradually with iteration. However, a suitable initial learning rate is difficult to tune. Linear search [10] and grid search are often used to find the optimal learning rate, but the computational overhead is high. Cyclical learning rates method [11] changes the learning rate periodically within a fixed bound, which can practically eliminate the need to experimentally find the best values and schedule for the global learning rates. Then a super-convergence method [12] is proposed to train networks with one learning rate cycle and a large maximum learning rate, which can achieve an increase in performance compared with standard methods. However, the uniformly scaled gradient still makes these methods perform poorly when the data set is sparse or ill-scaled.

In recent years, a series of adaptive gradient methods have been proposed. These methods scale the gradient by some form of squared past gradients, which can achieve a rapid training speed with an element-wise scaling term on learning rates [13]. Adagrad [9] is the first popular algorithm to use an adaptive gradient, which has obviously better performance than SGD when the gradients are sparse. However, the learning rate of Adagrad will drop rapidly because of its accumulation of the squared gradients in the denominator, which may lead to deterioration in the case that the loss functions are non-convex or gradients are dense. Then Adadelta [14], RMSprop [15], Adam [16], Nadam [17] are proposed to fix this issue, which use the exponential moving averages of squared past gradients to avoid the rapid drop of learning rate. These algorithms have been successfully applied to a variety of practical problems, especially Adam has become the default algorithm for training neural networks.

When training DNNs with adaptive gradient methods, the loss function decreases rapidly in the early stage of training, but the final training loss and test loss are worse than that of SGD in many applications. Moreover, since the learning rate of Adam does not decrease monotonously, the training process will diverge in some applications [18]. Some work has proposed a hybrid scheme of Adam and SGD to solve these problems. SWATS [19] proposes a strategy that Adam can be switched to SGD when a triggering condition is satisfied, which can close the generalization gap between Adam and SGD. ADABOUND [13] can achieve a gradual and smooth transition from adaptive methods to SGD by employing dynamic bounds on learning rates. For these hybrid algorithms, the switching time of Adam and SGD and the learning rate of SGD after switching still have a great impact on the performance of the algorithm, which should be tuned elaborately.

In this paper, we study the adaptive gradient algorithms and propose a bounded scheduling method for Adam, called Bsadam, to improve the performance when training neural networks. The major contributions of this paper include:


4. We train multiple tasks on several models to evaluate the algorithm. MNIST [20] is trained on simple neural networks, CIFAR-10 [21] and CIFAR-100 [21] are trained on ResNet [22], DenseNet [23] and SENet [24], Penn Treebank [25] is trained on LSTM [26]. All these experiments show that our method is capable of eliminating the generalization gap between Adam and SGD and maintaining a higher convergence speed in training.

The rest of our paper is organized as follows. In Section 2, the background of this paper is reviewed, where the traditional learning rate methods and adaptive gradient methods are described. In Section 3, we introduce the bounded scheduling scheme for Adam. In Section 4, we present a series of experiments to verify the effectiveness of our method. In Section 5, we summarize the paper.

#### **2. Background**

#### *2.1. Traditional Learning Rate Methods*

Learning rate is one of the most important hyper-parameters of gradient-based optimization methods, there have been many related works on it. Line search [10] is often used to find the learning rate of the full gradient. The line search method will set a large initial learning rate and try a learning rate at each iteration, if the loss function does not fall a certain distance than the current value, the learning rate will decrease proportionally and iterate again, until the learning rate satisfying the fall condition is found. Line search needs a large amount of computation and is often used when the data set is small. A line search method for SGD is also proposed [27]. This method uses random samples to do basic line search and estimates the Lipschitz constant *L*, then deduces the theoretical optimal learning rate based on *L*. However, the optimal learning rate, in theory, is different from that in practice and this method can not guarantee convergence.

Barzilai-Borwen method [28,29] is also often used to estimate the learning rate. The Barzilai-Borwen method is based on the quasi-Newton method and uses second-order derivative information to evaluate the learning rate, which requires little extra computational overhead. However, the learning rate estimated by the Barzilai-Borwen method may lead to the divergence of the training process. Yann Ollivier et al. proposed a method to view the whole performance of the learning trajectory as a function of the learning rate, then adapt the learning rate by performing a gradient descent on the learning rate itself [30]. Although these methods do not need to search the learning rate, their performance is not good enough compared with the manually tuned optimal learning rate. Cyclical learning rate method [11] does not need to use a certain learning rate, but makes the learning rate vary periodically in a certain range. Then super-convergence [12] is proposed to train DNNs with one cycle and a large maximum learning rate, which provides a boost in performance. Traditional learning rate methods scale the gradient uniformly in all directions, the performance of which will decrease when data sets are sparse or ill-scaled.

#### *2.2. Adaptive Gradient Methods*

The recently proposed adaptive gradient methods can provide an element-wise scaling term on learning rates without the need to tune the learning rate manually. These methods use historical information to estimate the curvature of the loss function and adopt different learning rates for each parameter, so the learning rate is a vector and each element for a parameter, which is different from the traditional learning rate methods. The representative adaptive gradient methods are Adagrad [9], RMSprop [15], Adam [16], AMSgrad [18], etc.

Adagrad [9] is the first proposed adaptive gradient method. Its main idea is to adopt a smaller learning rate for the parameters corresponding to frequent features and a larger learning rate for the parameters corresponding to infrequent features. Therefore, Adagrad is very suitable for training sparse data, which can improve the robustness of SGD. The update of Adagrad can be shown as follows:

$$w\_k = w\_{k-1} - \eta \frac{\nabla f(w\_{k-1})}{\sqrt{v\_k} + \epsilon},\tag{3}$$

where

$$w\_k = \Sigma\_{j=0}^{k-1} \nabla f(w\_j)^2 \, , \tag{4}$$

*e* is a smoothing term that avoids division by zero, *η* is general learning rate.

Adagrad uses the accumulation of the squared gradients and the squared gradients are positive, which will lead to a rapid decline in learning rate to infinite small and the standstill of loss function. RMSprop [15] was proposed to solve the problem of the rapid disappearance of the gradient for Adagrad. The update rule of RMSprop is the same as (3), but the updating of *v<sup>k</sup>* adopts exponential decaying average of square gradients, which can be shown as follows:

$$
\sigma\_k = \beta v\_{k-1} + (1 - \beta) \nabla f(w\_{k-1})^2,\tag{5}
$$

where *β* ∈ [0, 1) is the hyper-parameter that controls the exponential decay rate of average. The use of the exponential moving averages of squared past gradients can prevent the rapid rise of *v<sup>k</sup>* and the learning rate will not decline rapidly.

Adam [16] can also calculate adaptive learning rate for each parameter. As a complement to RMSprop, Adam preserves the exponential moving averages of squared past gradients, as well as the exponential moving averages of past gradients, which gives the gradient momentum. The update formula of Adam is shown as follows:

$$w\_k = w\_{k-1} - \eta \cdot \frac{\sqrt{1 - \beta\_2^k}}{1 - \beta\_1^k} \cdot \frac{m\_k}{\sqrt{v\_k} + \epsilon},\tag{6}$$

where

$$\begin{aligned} m\_k &= \beta\_1 m\_{k-1} + (1 - \beta\_1) \nabla f(w\_{k-1}), \\ v\_k &= \beta\_2 v\_{k-1} + (1 - \beta\_2) \nabla f(w\_{k-1})^2, \end{aligned} \tag{7}$$

where *β*1, *β*<sup>2</sup> ∈ [0, 1) are hyper-parameters that control the exponential decay rate of moving average.

Reddi et al. pinpoint that the use of exponential moving averages of squared past gradients may make Adam fail to converge to the optimal solution. As a result, AMSGrad was proposed [18]. Unlike Adam, AMSGrad uses the maximum of exponential moving averages of squared past gradients, the update rule of *v<sup>k</sup>* is show as follows:

$$\begin{aligned} \hat{v}\_k &= \beta\_2 \hat{v}\_{k-1} + (1 - \beta\_2) \nabla f(w\_{k-1})^2, \\ v\_k &= \max(\hat{v}\_{k'} v\_{k-1}). \end{aligned} \tag{8}$$

The adaptive gradient methods has low generalization ability in training complex models and its performance is worse than the optimal learning rate tuned manually.

#### **3. Bsadam: Bounded Scheduling Method for Adam**

#### *3.1. Preliminaries*

Firstly, we use an empirical study to illustrate the existence of the generalization gap in Adam. We use SGD and Adam to do image classification for CIFAR-10 data set on ResNet-34 architecture and present training accuracy and test accuracy in Figure 1. As can be seen from Figure 1, the training and test accuracy of Adam both increased faster than that of SGD in the early stage. However, when the learning rate is reduced by 10 after 100 epochs, the training and test accuracy of Adam are lower than that of SGD. Although the final training accuracy of Adam reaches the level of SGD, its test accuracy is still 1% to 2% lower than that of SGD, which means that its generalization gap is larger than SGD.

There may be various factors that may lead to the weakly empirical generalization capability of Adam. Based on previous researches [13,19,31–33], we summarize these factors and work to eliminate them. The main factors can be listed as follows.

**Figure 1.** Training the ResNet-34 architecture on the CIFAR-10 data set with stochastic gradient descent (SGD) and Adam. Adam has a faster initial convergence speed, but the final test accuracy is lower than that of SGD.


Taking all these factors into account, some improvements needs to be considered for Adam. Upper and lower bounds should be specified to avoid the side effect caused by extreme large and small learning rate. At the later stage of training, learning rate should be monotonous decreased to ensure the convergence and be uniformly scaled to improve generalization performance.

#### *3.2. Specify Bounds for Adam*

In this paper, we use the curve of loss function obtained by learning rate range test (LR range test) [11] to determine the upper and lower bounds of the learning rate for Adam. When training a new model or data set, the LR range test is a very effective way to find a reasonable learning rate range for SGD, although it can not find a specific learning rate. LR range test uses SGD to train the model for several epochs and makes the learning rate increase linearly from small to large, then the approximate range of reasonable learning rate can be estimated by the curve of the loss function. Specifically, when the loss decreases, it means that the current learning rate is reasonable when the loss rises, it means that the current learning rate is inappropriate.

However, as a result that Adam itself has the function of adjusting the learning rate, the standard of specifying the bounds for Adam is different from the classical LR range test, we need a wider range of bounds. Specifically, the lower bound can be set to the point where the loss function begins to decline and the upper bound can be set to the point where the loss function begins to rise. What is more, in order to get better generalization ability, the upper bound can be enlarged within five times.

For example, we use Resnet-34 architecture to perform the LR range test on CIFAR-10 and obtain the curve of loss function along with the learning rate. The result is shown in Figure 2. As can be seen from Figure 2, the loss begins to decline obviously when the learning rate is 0.001, so 0.001 can be used as the lower bound of the learning rate for Adam. When the learning rate is 0.1, the loss starts to rise and the training process starts to diverge, so 0.1 can be used as the upper bound of the learning rate for Adam. However, through experiments, we find that the algorithm can get better minima by increasing the upper bound appropriately, so the upper bound can be set to 0.5. The upper and lower bounds of learning rate are limited, the negative effects of too large or too small learning rate on Adam can be eliminated.

**Figure 2.** Learning rate range test. The x-axis is learning rate (log scale), the y-axis is training loss.

#### *3.3. Schedule Bounds for Adam*

We improve the empirical generalization capability of Adam by scheduling its lower and upper bounds, which can reduce the adverse effects of the non-uniform scaling of the gradients and the non-monotonically decreasing learning rate. According to the updated formula of Adam, we can regard q 1−*β k* 2 1−*β k* 1 · *η* <sup>√</sup>*vk*+*<sup>e</sup>* as the learning rate of Adam and *m<sup>k</sup>* as gradients with momentum of Adam. Gradient clipping can constrain the learning rate to a certain range, which is an effective method to solve the problem of gradient disappearance or gradient explosion. We use gradient clipping to clip the learning rate of Adam which exceeds the threshold. Consider applying the following operations to Adam

$$\text{Clip}(\frac{\sqrt{1-\beta\_2^k}}{1-\beta\_1^k}\cdot\frac{\eta}{\sqrt{v\_k}+\epsilon},\text{min\\_lr\\_max\\_lr}),\tag{9}$$

which can clip the learning rate of Adam element-wisely such that each element of the learning rate is limited in the range of [*min*\_*lr*, *max*\_*lr*], where *min*\_*lr* and *max*\_*lr* are lower bound and upper bound found in Section 3.2 respectively.

Then we will schedule the bounds of learning rate. The scheduling process is divided into three phases, which are finding minima, converging and uniform scaling. The scheduling details for each phase are described in detail below.

#### 3.3.1. Finding Minima

In this phase, we use the concept of super-convergence, which implies that a large maximum learning rate can achieve better generalization capability. Using a relatively large learning rate in the early stage of training can make the loss function skip the suboptimal solution more easily and find wide, flat minima. Therefore, we fix the upper bound of learning rate and gradually increase the lower bound of learning rate, so that each element of learning rate can gradually rise to the upper bound. In this phase, gradient clipping can be expressed as follows:

$$\text{Clip}(\frac{\sqrt{1-\beta\_2^k}}{1-\beta\_1^k}\cdot\frac{\eta}{\sqrt{v\_k}+\epsilon'}, \text{asending}(t), \max\text{lyr}\_{\text{llr}}), \tag{10}$$

where *ascending*(*t*) is a function that lower bound increases gradually from *min*\_*lr* to *max*\_*lr* with iteration and *t* means the progress of epoch in this phase. *ascending*(*t*) can be linear, exponential and trigonometric, which can be formulated as follows:

• linear rise:

$$\text{ascending}(t) = \min\\_lr + t \cdot \frac{\max\\_lr - \min\\_lr}{T},\tag{11}$$

• exponential rise:

$$\text{ascending}(t) = \min \\_lr \cdot (\frac{\max\\_lr}{\min\\_lr})^\dagger\,\tag{12}$$

• trigonometric rise:

$$\text{ascending}(t) = \min\\_lr + (\max\\_lr - \min\\_lr)\sin(\frac{t}{T}\cdot\frac{\pi}{2}),\tag{13}$$

where *T* is the total epochs in this phase.

#### 3.3.2. Converging

In this phase, to avoid the divergence or poor generalization performance caused by the non-monotonic decline of learning rate, we need to make sure that the learning rate of Adam is decreasing. Therefore, we fix the lower bound of learning rate and gradually decrease the upper bound of learning rate, so that each element of learning rate can gradually decrease to the lower bound. In this phase, gradient clipping can be expressed as follows:

$$\text{Clip}(\frac{\sqrt{1-\beta\_2^k}}{1-\beta\_1^k}\cdot\frac{\eta}{\sqrt{v\_k}+\epsilon}, \min\\_lr, \text{descending}(t)),\tag{14}$$

where *descending*(*t*) is a function that upper bound decreases gradually from *max*\_*lr* to *min*\_*lr* with iteration and *t* means the progress of epoch in this phase. *descending*(*t*) can be linear, exponential and trigonometric, which can be formulated as follows:

• linear decrease:

$$\text{descending}(t) = \max\\_lr - t \cdot \frac{\max\\_lr - \min\\_lr}{T},\tag{15}$$

• exponential decrease:

$$\text{descending}(t) = \max\\_lr \cdot (\frac{\min\\_lr}{\max\\_lr})\dagger,\tag{16}$$

• trigonometric decrease:

$$\text{descending}(t) = \max\\_lr - (\max\\_lr - \min\\_lr)\sin(\frac{t}{T}\cdot \frac{\pi}{2}),\tag{17}$$

where *T* is the total epochs in this phase.

#### 3.3.3. Uniform Scaling

There is a conventional phase for training neural networks, which is reducing the learning rate by 10 in the final stage of training, so that the algorithm will converge to the near minimum. In our algorithm, at the end of the converging phase, upper bound are reduced to *min*\_*lr*, so the gradients are uniformly scaled. We use *min*\_*lr* as a learning rate continuing training model. The training accuracy and test accuracy will be further improved and stabilized and the algorithm will eventually converge. In this phase, the gradients are uniformly scaled, which will help improve generalization performance.

### *3.4. Algorithm Overview*

Based on the above analysis, in this subsection, we propose a new variant of the optimization methods, named Bsadam, which can maintain the fast convergence of the algorithm in the early stage and obtain a good finally generalization capacity.

Empirically, the number of epoch in the first phase is the same as that in the second phase and the number of epoch in the third phase should be less than that in the former two phases. Specifically, if the total number of training epochs is *T*, the allocation of the number of epochs for three phases are 2*T* 5 , 2*T* 5 and *<sup>T</sup>* 5 , respectively. The details of Bsadam are illustrated in Algorithm 1, where *max*\_*lr* and *min*\_*lr* can be found by the method mentioned in Section 3.2, *β*1, *β*<sup>2</sup> and *η* is the hyper-parameters of Adam itself, *data*\_*loader*() is a function that combines a data set and a sampler and provides an iterable process over the given data set.

#### **Algorithm 1** Bsadam.

```
Parameters : total epochs T, max_lr, min_lr
Initialize : w0, β1, β2, η
 1: set v0 = 0, m0 = 0, k = 0
 2: for t = {1, 2, ..., 2T
                  5
                    } do
 3: ascending_lr = ascending(t)
 4: for data in data_loader() do
 5: k = k + 1
 6: compute gradient gk=∇ f(wk−1
                                        ) on data
 7: mk = β1mk−1 + (1 − β1)gk
 8: vk = β2vk−1 + (1 − β2)g
                                 2
                                 k
 9: lr = Clip(
                   q
                     1−β
                        k
                        2
                    1−β
                       k
                       1
                          ·
                             η
                           √vk+e
                                , ascending_lr, max_lr)
10: wk = wk−1 − lr · mk
11: end for
12: end for
13: for t = {1, 2, ..., 2T
                  5
                    } do
14: ascending_lr = descending(t)
15: for data in data_loader() do
16: k = k + 1
17: compute gradient gk = ∇ f(wk−1
                                         ) on data
18: mk = β1mk−1 + (1 − β1)gk
19: vk = β2vk−1 + (1 − β2)g
                                 2
                                 k
20: lr = Clip(
                   q
                     1−β
                        k
                        2
                    1−β
                       k
                       1
                          ·
                             η
                           √vk+e
                                , min_lr, descending_lr)
21: wk = wk−1 − lr · mk
22: end for
23: end for
24: for t = {1, 2, ..., T
                  5
                   } do
25: lr = min_lr
26: for data in data_loader() do
27: k = k + 1
28: compute gradient gk=∇ f(wk−1
                                        ) on data
29: wk = wk−1 − lr · gk
30: end for
31: end for
```
## **4. Experiments**

To illustrate the effectiveness of our algorithm, we experimented with different models on different data sets to compare the new variant with other popular optimization methods, such as SGD with momentum (SGDM), Adagrad and Adam. We mainly consider two problems that are often solved by deep neural networks: image classification and language modeling. The models used in the experiment include feedforward neural network, convolutional neural network [34], deep convolutional neural network and recurrent neural network, The data sets used in the experiment are MNIST [20], CIFAR-10 [21], CIFAR-100 [21], Penn Treebank [25]. All these models or data sets are often encountered in deep learning.

### *4.1. Experimental Setup*

We implemented these experiments on a server configured as 2 NVIDIA TITAN XP GPUs, 1 Intel I7-6800K CPU, 16G\*8 DDR4, 240G SSD and 1T SATA. These experiments were coded in PyTorch, each experiment was run three times and we chose the best one.

The algorithms under consideration have many hyper-parameters and the setting of hyperparameters has a great influence on the performance of the optimization algorithm. Here we will describe how we adjust hyper-parameters. We use a logarithmical grid search on a large space of learning rate and then fine-tune it, the results are shown in Table 1. Specifically, the learning rate of each algorithm is adjusted as follows:


Other hyper-parameters such as batch size and weight decay use the default values recommended by the model.

**Data Set Model Network Type SGD(M) Adagrad Adam Bsadam** MNIST 1-Layer Perceptron Feedforward 0.1 0.001 0.001 (0.01,0.5) MNIST 1-Layer Convolutional Convolutional 0.1 0.001 0.001 (0.01,0.5) CIFAR-10 ResNet Deep Convolutional 0.1 0.001 0.001 (0.01,0.5) CIFAR-10 DenseNet Deep Convolutional 0.1 0.001 0.001 (0.01,0.5) CIFAR-10 SENet Deep Convolutional 0.1 0.001 0.001 (0.01,0.5) CIFAR-100 ResNet Deep Convolutional 0.3 0.001 0.001 (0.05,1) CIFAR-100 DenseNet Deep Convolutional 0.1 0.001 0.001 (0.05,1) CIFAR-100 SENet Deep Convolutional 0.1 0.001 0.001 (0.01,0.5) Penn Treebank 1-Layer LSTM Recurrent 50 0.001 0.001 (5,100) Penn Treebank 2-Layer LSTM Recurrent 50 0.001 0.001 (5,100)

**Table 1.** Summarizing the models and the data sets utilized for our experiments. The optimal hyperparameters for stochastic gradient descent (SGD) with momentum (M), Adagrad, Adam and Bsadam for all experiments are also listed.

#### *4.2. Image Classfication*

#### 4.2.1. Simple Neural Network

The MNIST database of handwritten digits has a training set of 60,000 images, and a test set of 10,000 images, which can be divided into 10 classes. We train a simple fully connected neural network with one hidden layer and a one-layer convolutional network with one convolutional layer and one fully connected layer for the image classification problem on the MNIST dataset. We run 100 epochs and decay the learning rate by 10 at 80th epoch for fully connected neural network, then we run 75 epochs and decay the learning rate by 10 at 60th epoch for convolutional network.

Figure 3 shows the learning curve of each optimization method, which includes training accuracy and test accuracy. We find that all the optimization algorithms can achieve nearly 100% accuracy on the training set. However, the accuracy of each algorithm will be different on the test set. Among these algorithms, Adagrad converges fastest on the training set, but achieves lower accuracy on the test set and SGDM has a slightly better accuracy on the test set than Adam and Adagrad. Our proposed Bsadam has better convergence speed than SGDM in the early stage. Especially in the converging phase, the convergence speed of Bsadam is faster than all the compared algorithms on both training and test set. Moreover, the final test accuracy of Bsadam is as good as fine-tuned SGDM, which means that our algorithm can get faster training speed without sacrificing accuracy when training simple neural networks. We also run RMSProp and Nesterov with default setting on MNIST. We find that RMSProp has worst convergence speed and test accuracy, Nesterov has similar performance with SGD with momentum. So our method still has advantages over these methods.

(**a**) Training accuracy for fully connected neural network (**b**) Test accuracy for fully connected neural network

(**c**) Training accuracy for one-layer convolutional neural Network (**d**) Test accuracy for one-Layer convolutional neural network

**Figure 3.** Training and test accuracy for fully connected neural network and one-layer convolutional neural network on MNIST.

#### 4.2.2. Deep Convolutional Network

We evaluate our algorithm on a more complex deep convolutional network. Specifically, we perform experiments with three architectures: ResNet-34 [22], DenseNet-121 [23] and SENet-34 [24] on CIFAR-10 and CIFAR-100 data sets for image classification tasks. These data sets have a training set of 50,000 32 × 32 RGB images, and a test set of 10,000 images, which can be divided into 10 classes for CIFAR-10 and 100 classes for CIFAR-100. We ran 125 epochs for all the compared algorithms and decay the learning rate by 10 at 100th epoch.

Figure 4 shows the learning curve of each optimization method running on CIFAR-10, which includes training accuracy and test accuracy. As we can see, Adagrad had faster convergence speed and higher accuracy on training set, its accuracy is the lowest on test set, which indicates that its generalization gap is relatively large. Adam converges faster than SGDM in the early training, but the final test accuracy is lower than SGDM. SGDM has the slowest convergence speed on training set and test set, but its final test accuracy is higher than Adam and Adgrad, which means that its generalization capability is better than adaptive gradient methods. Our proposed Bsadam converges faster than fine-tuned SGDM in the early training. Especially in the converging phase, the convergence speed of Bsadam is faster than all the compared algorithms on both training and test set. The final training and test accuracy of Bsadam are the highest among all the compared algorithms, which indicates that our algorithm can accelerate the training process and improve the accuracy for complex deep neural networks.

Figure 5 shows the learning curve of each optimization method running on CIFAR-100, which includes training accuracy and test accuracy. The experimental results shown in Figure 5 are similar to Figure 4. The adaptive gradient methods often exhibit a relatively large generalization gap. Bsadam can achieve faster convergence speed and higher convergence accuracy on both training and test set.

**Figure 4.** *Cont.*

**Figure 4.** Training and test accuracy for ResNet-34, SENet-34 and DenseNet-121 on CIFAR-10.

**Figure 5.** Training and test accuracy for ResNet-34, SENet-34 and DenseNet-121 on CIFAR-100.

#### *4.3. Language Modeling*

To illustrate the wide applicability of our algorithm, we also conduct experiments with the recurrent network. Specifically, we perform experiments with long short-term memory (LSTM) network [26] on Penn Treebank data set for word-level language modeling tasks. We compare our algorithm with Adam and SGD without the moment. We ran 125 epochs for all the compared algorithms and decay the learning rate by 10 at 100th epoch.

Figure 6 shows the learning curve of each optimization method running on Penn Treebank, which includes training accuracy and test accuracy. We find that the training perplexity of a two-layer LSTM is lower than a one-layer LSTM, but the valid perplexity is almost the same, which indicates that the complexity of the network may weaken the generalization capability of the algorithm. Although Adam achieves a lower perplexity on the training set, the final perplexity on a valid set is relatively high. SGD converges slowly in the early stage on a valid set, but the final perplexity is lower than Adam. Bsadam converges slowly in finding minima phase, but in converging phase, training and valid perplexity both decrease rapidly and the overall convergence speed is faster than SGD. What is more, Bsadam can get a similar or better final perplexity compared to fine-tuned SGD.

**Figure 6.** Training and valid perplexity for long short-term memory (LSTM) with different layers on Penn Treebank.

#### *4.4. Comparison of Different Scheduling Methods*

In this paper, we propose three bounded scheduling methods: linear scheduling, exponential scheduling and trigonometric scheduling. We use these three bounded scheduling methods to train SENet-34 on CIFAR-10 and the learning curve is shown in Figure 7. As we can see, these scheduling methods have similar performance, but the details of the learning curve are slightly different. Exponential scheduling method has the fastest convergence speed among three scheduling methods, but the final test accuracy is lowest. Linear scheduling method has the highest final test accuracy, but the convergence speed is slowest among three scheduling methods.

**Figure 7.** Training and test accuracy of different scheduling methods for SENet-34 on CIFAR-10.

#### **5. Conclusions**

Towards the poor generalization capability of adaptive gradient methods in training deep neural networks, a bounded scheduling method, called Bsadam, is proposed in this paper. We first find the upper and lower bound for Adam, then divide the training process into three phases: finding minima phase allows the algorithm to overcome the suboptimal solutions by raising the lower bound of Adam, converging phase ensures the convergence of the algorithm by decreasing the upper bound of Adam and uniform scaling phase allows the algorithm converge to the minimum. We evaluate our algorithm by using simple neural networks, deep convolution networks and recurrent network to perform image classification and language modeling tasks. The experimental results show that our algorithm outperforms SGD(M) and the adaptive gradient methods in convergence speed and accuracy.

**Author Contributions:** Conceptualization, M.T.; methodology, M.T.; software, M.T.; validation, Z.H., Y.Y. and C.W.; formal analysis, M.T. and Z.H.; investigation, M.T.; resources, Y.P.; data curation, M.T.

**Funding:** This research was funded by The National Key Research and Development Program of China grant number 2016YFB1000100.

**Acknowledgments:** All authors thank the referees for their valuable suggestions and help.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article*

## **A Novel Bio-Inspired Method for Early Diagnosis of Breast Cancer through Mammographic Image Analysis**

**David González-Patiño <sup>1</sup> , Yenny Villuendas-Rey <sup>2</sup> , Amadeo-José Argüelles-Cruz 1,\* and Fakhri Karray <sup>3</sup>**


Received: 28 September 2019; Accepted: 17 October 2019; Published: 23 October 2019

**Abstract:** Breast cancer is a current problem that causes the death of many women. In this work, we test meta-heuristics applied to the segmentation of mammographic images. Traditionally, the application of these algorithms has a direct relationship with optimization problems; however, in this study, its implementation is oriented to the segmentation of mammograms using the Dunn index as an optimization function, and the grey levels to represent each individual. The update of grey levels during the process results in the maximization of the Dunn's index function; the higher the index, the better the segmentation will be. The results showed a lower error rate using these meta-heuristics for segmentation compared to a well-adopted classical approach known as the Otsu method.

**Keywords:** mammogram; meta-heuristics; optimization; breast cancer; segmentation; detection

#### **1. Introduction**

Breast cancer is an acute health problem all over the world and one of the most common cancers that cause death among women. This cancer has caused over 1.7 million deaths in 2012 and nearly 5 million cases were diagnosed. The estimation for 2030 is that breast cancer cases will continue to grow in developing countries [1].

To reduce these cases and to provide treatments, it is necessary to have better ways of diagnosing breast cancer early on. Using algorithms to help in the early diagnosis would be desirable [2]. The algorithms can use the result images given by other tests like mammography studies or other screening techniques to obtain an easier image to analyze. These algorithms can obtain a simplified representation of the image only by extracting the region of interest from the image for later analysis.

Physicians use mammography studies to obtain an image of the breast, using X-rays to detect breast cancer followed by a detailed analysis of the generated image. These evaluations have resulted in a reduction of breast cancer mortality [3]. When cancer is in the early phases, the group of abnormal cells is found in the same region, and hence it can be easily detected using mammography [4]. The experts highly recommend a mammography study every one or two years for women aged 39 to 69 [5,6]. After the generation of the mammography image, it is necessary to process it to identify the Region of Interest (ROI) of the image for later analysis. This process receives the name of segmentation.

However, this is not the only way to carry out these studies; as an example, a different way is seen in the work carried out by Yu et al. in 2017 [7]. They present an improvement in the images taken by ultrasound tomography, which improves the segmentation and the later analysis and classification of the image. In 2013, Duric et al. [8] performed a comparison between ultrasound tomographies and digital mammographies showing that both methods are positively associated with the identification of the amount of dense tissue. They performed the study by comparing the volume-averaged sound speed of the breast in ultrasound tomography and mammographic percent density in mammographies.

Segmentation involves splitting a digital image into non-overlapping groups of pixels to make it easy to interpret [9]. This process is useful to find objects and boundaries, and it can be used in many applications such as object detection, recognition, machine vision, and medical images [10]. A perfect segmentation should have uniform and homogeneous characteristics, boundaries should be simple, interiors should not have small holes if possible, and adjacent regions should have different characteristics [11]. Since the previous characteristics are hard to achieve, there are techniques that try to find the best segmentation for digital images, each of them with advantages and disadvantages [10].

According to Fu et al. [12], such techniques can be conveniently summarized into three types: edge detection, region extraction, and clustering segmentation.

Edge detection is a technique based on distinguishing regions with the highest change in grey level. It also detects discontinuities in depth and surface, changes in material properties, and variations in scene illumination and brightness.

Several studies report the progression and implementation of edge detection. Dollár and Zitnick [13] uses a segmentation technique with a structured learning framework applied to random decision forests. Malik et al. [14] describe the Canny edge detection method to identify lines in a Finger Knuckle Print, a biometric identifier adopted in recent years.

Qi et al. [15] worked infrared images and algorithms to transform edges into curves and produce intersections for breast cancer detection. Later on, Mencattini et al. [16] proposed an algorithm to detect micro-calcifications and masses. It also detects the edges using an enhancement procedure after a transformation process.

Song et al. [17] used Fully Convolutional Networks to segment and detect corners in aerial images of buildings. Their work achieved good performance in this task, outperforming several algorithms in their corresponding comparison.

The region extraction approach splits an image into regions using merging or dividing techniques. Fan et al. [18] proposed an image segmentation method using edges as first elements, incorporating pixels into each region at each iteration. A similar technique was used by Yan et al. in 2003 [19] using local entropy, getting good performance with noisy images. The use of a region extraction algorithm is convenient because a region comprises more pixels and achieves better performances with noisy images, providing more information than edge detection techniques [18], which face issues with noisy images. Combining both region extraction and edge detection provides more detail information about the image [18].

Clustering or feature thresholding consider choosing a threshold value that maps all the pixels into different clusters. These threshold values can be grey level, gradients or percentages. Yao et al. [20] used an algorithm based on clustering to segment fish images. Salem [21] reported a similar technique to find white cells using a k-means algorithm. And Patel and Sinha [22] used a clustering algorithm based on adaptive k-means, diversifying the parameters to boost image segmentation's performance. In general, clustering is easy to implement, simple, and the results are easy to interpret.

On one hand, the extraction of regions seeks to identify the region of interest for further analysis. On the other hand, clustering algorithms seek to group regions with similar characteristics regardless of whether they represent something relevant.

The clustering approach for segmentation is the protagonist in this work.

There are recent works on the detection of breast cancer using image processing and deep learning to achieve good results in both branches. In 2018, Mambou et al. [23] performed a comparative study of several algorithms for breast cancer detection using Infrared Thermal Imaging, obtaining relevant results in this field.

One of the classic methods for image thresholding is the Otsu method [24], which uses clustering to generate binary images. This method is still one of the most referenced thresholding methods because of its simplicity and good performance. This method is sensitive if the region of interest (ROI) or the background is bigger than the other region since it will classify the other pixels incorrectly. The Otsu method finds the optimal threshold by maximizing or minimizing variances between or within each class, respectively [25]. This method has been successfully used for image thresholding in many applications [26–28].

In this paper, we explore two algorithms based on populations, the novel bat algorithm (NBA) and the genetic algorithm (GA). In addition, we explore the simulated annealing algorithm (SA) based on a single trajectory. The GA is based on an evolutionary system and the NBA is based on swarm optimization.

There are a lot of segmentation techniques, but there is no method to determine which is the most useful algorithm for the segmentation of specific digital images. In this paper, we focus on meta-heuristics applied to the segmentation of digital images, specifically, mammography images.

This paper is organized as follows. The review of recent works in mammography image segmentation is presented in Section 2, with an explanation of the algorithms under test, while giving insights into their functioning. In Section 3, the proposed methodology is described. Section 4 provides the results and the analysis of the experiments. Finally, in Section 5 we present the discussion and conclusions of this work and future work in this area.

#### **2. Meta-Heuristics and Applications for Image Segmentation**

Meta-heuristics are algorithms that mimic fauna behavior or biological systems to solve computational problems such as optimization, classification, or segmentation [29]. Some algorithms used to solve optimization problems are not effective when the problem is nondeterministic since this requires many computational resources. Therefore, using meta-heuristics is one of the best choices because of their stochastic behavior [30]. These algorithms use randomly generated values to conduct a local search while exploring a larger space of solution.

In this paper, we present the use of some meta-heuristics applied to segmentation. We explain these meta-heuristics below.

#### *2.1. Simulated Annealing*

Simulated annealing (SA) [31] is a searching algorithm mainly designed for global optimization problems [32].

The simulated annealing (SA) algorithm is an optimization method inspired by the tempering of metals used since 5000 B.C. and belongs to a class of local search algorithms (LSA) commonly called threshold algorithms (TA).

To understand the simulated annealing method, it is necessary to understand that this technique is used in the industry to obtain more resistant or more crystalline properties to improve the qualities of materials. This same principle is adapted to a computational algorithm that works by mimicking this process.

The process consists of "melting" the material (heating it to a very high temperature). In this situation, the atoms gain a "random" distribution within the material structure, and the system energy is maximal. Then, the temperature is reduced in stages, allowing the atoms to remain in equilibrium in each of these stages (that is, the atoms reach an optimal configuration for that temperature). At the end of the process, the atoms form a highly regular crystalline structure, thus, the material achieves maximum strength, and the system energy is minimal.

The experiments report that if the process sharply reduces the temperature or if there is not enough time in each stage, the structure of the material is not optimal.

This algorithm presents three main phases: heating the material to a predetermined temperature, maintaining the temperature that allows the molecules to accommodate in states of minimum energy, and then a slow cooling of the material to allow an increase in the size of the crystals and a reduction of their defects. In each iteration, the algorithm tests the neighbors to find better solutions, considering new solutions with a probability of using that solution even if it is not a better solution.

Simulated annealing (SA) is a simple trajectory method that starts with a certain state, *S*. Through a particular process, it generates a neighbor state, *S'*, to the current environment. If the energy, or evaluation of the state *S'*, is better than the one of *S*, the element *S* changes to *S'*.

If the evaluation of *S'* is worse than that of *S*, *S'* is chosen instead of *S*, with some probability depending on the differences of the evaluations of both states and the current temperature *T*. The probability of choosing a worse state instead of the current state allows a local optimum to be left in order to reach the global optimum. In a minimization process, the probability of choosing a worse state is calculated by:

$$p = e^{-1\*\frac{f'(x) - f(x)}{T}} \tag{1}$$

In this work, given a state *S*, we obtain a neighboring state *S'* of the following form:

$$S'(i) = S(i) \pm rand(0, \text{255}) \tag{2}$$

The simulated annealing algorithm has several stages. Each stage corresponds to a lower temperature than the previous stage (this refers to the monotony: after each stage, the temperature goes down and the system cools). Therefore, a criterion of temperature change is required ("how much time" is waited at each stage to result in the system achieving its "thermal equilibrium").

If the temperature lowers sufficiently slowly (the temperature parameter and the generation of enough transitions) at each temperature, it can achieve the optimum configuration.

Recently, Manwar et al. [32] used an algorithm similar to simulated annealing to reduce the nonlinearity of Galvo scanners. In that study, they evaluated the algorithm in different frequencies to synthesize the signal. Their method showed better results compared to other methods for the compensation of Galvo scanners.

Similarly, in 2018 Fayyaz et al. [33], used simulated annealing (SA) to optimize an amplitude modulator showing the efficiency and effectiveness of SA when finding the optimum wavefront shape for focusing light.

#### *2.2. Genetic Algorithms*

In natural systems, different individuals of a population compete for resources to survive and reproduce. Those individuals with a better biological adaptation to their environment will preserve and enhanced their genetic material and pass that information to future generations, contrary to what happens with less adapted individuals, who will generate a smaller number of descendants, and the possibility of transmitting their genetic material information to the next generations is much smaller.

From an artificial systems perspective, computational thinking maps the adaptation coming from natural evolution processes into algorithms used for optimization problems. These heuristics-based algorithms try to imitate, in a certain way, the natural evolutionary process to get the best results for a specific problem. The evolutionary algorithms use a population to find solutions and identify the best one to solve an optimization problem. Among them are the genetic algorithms (GAs) that are computational models capable of adapting or recreating themselves based on mutation, crossing and selection phenomena [34], biological metaphors principles of the natural evolution of the species proposed by Darwin in 1859 [35]. Research on GAs began to develop in the 1950s, when some biologists tried to simulate genetic systems on a computer, such as A. Fraser [36], who did something very similar to GAs using chains and phenotypes, trying to simulate them on the SILLIAC computer. Years later in 1975, John H. Holland [37] was the first to gather and develop the critical mass of ideas from systems theory, mathematics, and computational science to get the principles of evolution to search for the optimal results of a particular problem. The principles developed by Holland led to the birth of the GAs and their use in the theory of intelligent adaptive systems.

systems.

As stated before, GAs algorithms work with a population of individuals included within a specific solution of a fitness function for the optimization problem. Each of these individuals has a value of the function being tested; this value represents the adaptation of the individual to the environment. The better the adaptation of an individual to the problem, the more likely they are to transmit their genes to the next generations. The solutions that obtain the best results for the solution of the aim function will be those that are preserved during the process of optimization for future generations of individuals. Therefore, this value of the function will determine if an individual is a good candidate or not and will guide the search for good solutions. As stated before, GAs algorithms work with a population of individuals included within a specific solution of a fitness function for the optimization problem. Each of these individuals has a value of the function being tested; this value represents the adaptation of the individual to the environment. The better the adaptation of an individual to the problem, the more likely they are to transmit their genes to the next generations. The solutions that obtain the best results for the solution of the aim function will be those that are preserved during the process of optimization for future generations of individuals. Therefore, this value of the function will determine if an individual is a good candidate or not and will guide the search for good solutions.

*Sensors* **2019**, *19*, x FOR PEER REVIEW *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 5 of 16

With individuals chromosomally coded, the algorithm can carry out the evolution of the solutions by following the steps described in Figure 1 [38]: With individuals chromosomally coded, the algorithm can carry out the evolution of the solutions by following the steps described in Figure 1 [38]:

**Figure 1.** Flowchart of genetic algorithm. **Figure 1.** Flowchart of genetic algorithm.

As a summary, the genetic algorithm starts with the randomness generation of population at the beginning and uses the crossing (exchange of genetic material between two individuals), mutation (alteration of the internal elements of each individual), and the selection operators (operators designed to select the most suitable individuals for each problem) to select the best As a summary, the genetic algorithm starts with the randomness generation of population at the beginning and uses the crossing (exchange of genetic material between two individuals), mutation (alteration of the internal elements of each individual), and the selection operators (operators designed to select the most suitable individuals for each problem) to select the best solutions in each iteration.

#### solutions in each iteration. *2.3. Novel Bat Algorithm*

*2.3. Novel Bat Algorithm*  The bat algorithm (BA) is an algorithm proposed by Yang and Hossein in 2012 [39] as an algorithm used to solve optimization problems. Later, Meng et al. [40] transformed the algorithm to include the Doppler effect, giving this version the name of novel bat algorithm (NBA). This algorithm uses bats' behaviors and the echolocation ability to identify objects in a determinate space. The bats use echolocation to sense distance and change velocity and frequency to search for The bat algorithm (BA) is an algorithm proposed by Yang and Hossein in 2012 [39] as an algorithm used to solve optimization problems. Later, Meng et al. [40] transformed the algorithm to include the Doppler effect, giving this version the name of novel bat algorithm (NBA). This algorithm uses bats' behaviors and the echolocation ability to identify objects in a determinate space. The bats use echolocation to sense distance and change velocity and frequency to search for solutions for the fitness function that is being optimized.

solutions for the fitness function that is being optimized. One of the novel features of the NBA algorithm is the consideration and modelling of the compensation of the Doppler effect on the pulses emitted and the environmental noise. Another aspect to highlight is the inclusion of the quantum behavior of bats, which allows these virtual bats One of the novel features of the NBA algorithm is the consideration and modelling of the compensation of the Doppler effect on the pulses emitted and the environmental noise. Another aspect to highlight is the inclusion of the quantum behavior of bats, which allows these virtual bats to find food in different habitats to find the optimal overall solution to the problem.

to find food in different habitats to find the optimal overall solution to the problem. There are different variants of bat-inspired algorithms based on modelling some echolocation characteristics of bats. Therefore, the authors of the BA proposal in [41] provide a set of rules for There are different variants of bat-inspired algorithms based on modelling some echolocation characteristics of bats. Therefore, the authors of the BA proposal in [41] provide a set of rules for optimization algorithms based on the behavior of bats, as described in [40]:


• Although the volume may vary in various ways in reality, it is important to consider that the noise varies from a positive number *V*<sup>0</sup> to a constant minimum value *Vmin*. pulse emission rate ݈ ߳ ሾ0,1ሿ. • Although the volume may vary in various ways in reality, it is important to consider that the noise varies from a positive number ܸ to a constant minimum value ܸ.

they can automatically adjust the wavelength of their emitted pulses (frequency), and adjust the

To this brief set of rules, we must add a couple more, which present a direct relationship with the new features proposed by the NBA algorithm [40]: To this brief set of rules, we must add a couple more, which present a direct relationship with the new features proposed by the NBA algorithm [40]:


Finally, *z G i*,*j* (*i* ∈ {1, . . . , *n*}, *j* ∈ {1, . . . , *d*}) characterizes the position and *q G i*,*j* the velocities in an iteration *G*, of *n* bats looking for food in a *d*-dimensional space. ,ݖ ,Finally ீ ሺ݅ ∈ ሼ1, … , ݊ሽ,݆ ∈ ሼ1, … , ݀ሽሻ characterizes the position and ݍ, ீ the velocities in an iteration *G*, of *n* bats looking for food in a *d*-dimensional space.

Updating the positions, velocities and other parameters of bats' behavior takes *G* number of iterations. Each one will perform the following steps explained in Figure 2: Updating the positions, velocities and other parameters of bats' behavior takes *G* number of iterations. Each one will perform the following steps explained in Figure 2:

**Figure 2.** Flowchart of the novel bat algorithm. **Figure 2.** Flowchart of the novel bat algorithm.

As seen, this algorithm can self-adapt throughout the optimization process, because of the compensation it makes for the Doppler effect in echoes and the adjustments it makes to the frequency according to the proximity of individuals to the solution. It is also worth mentioning the addition of habitat selection methods to BA, since these two types of behavior of bats, mechanical and quantum, allow the algorithm to have a better convergence and diversity of solutions. As seen, this algorithm can self-adapt throughout the optimization process, because of the compensation it makes for the Doppler effect in echoes and the adjustments it makes to the frequency according to the proximity of individuals to the solution. It is also worth mentioning the addition of habitat selection methods to BA, since these two types of behavior of bats, mechanical and quantum, allow the algorithm to have a better convergence and diversity of solutions.

This algorithm was used for the segmentation of mammography images by González-Patiño et al. in 2016 [42], showing lower segmentation errors compared to the Otsu method. This algorithm was used for the segmentation of mammography images by González-Patiño et al. in 2016 [42], showing lower segmentation errors compared to the Otsu method.

The NBA and GA algorithms are population-based algorithms so they perform a more extensive search throughout the search space, which provides an advantage compared to SA which is a single-trajectory algorithm. The NBA and GA algorithms are population-based algorithms so they perform a more extensivesearch throughout the search space, which provides an advantage compared to SA which is a single-trajectory algorithm.

#### **3. Proposed Method 3. Proposed Method**

In this work, we propose a new method for the segmentation of mammographic images through meta-heuristics. This process is relevant because most meta-heuristics are designed only as optimization algorithms. In this work, we propose a new method for the segmentation of mammographic images through meta-heuristics. This process is relevant because most meta-heuristics are designed only as optimization algorithms.

This methodology consists of three stages: This methodology consists of three stages:


and 255.

of the shape and location of the lesion. With these 38 descriptors, a data bank is formed for subsequent classification. 3. Once the database is formed, a classification algorithm is used to get a pre-diagnosis of the corresponding lesion.

The proposal of the present investigation for phase 1 of the method consists of the modification

3. Once the database is formed, a classification algorithm is used to get a pre-diagnosis of the corresponding lesion. We explain each stage in the following subsections.

*Sensors* **2019**, *19*, x FOR PEER REVIEW *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 7 of 16

We explain each stage in the following subsections. *3.1. Image Segmentation as an Optimization Problem* 

#### *3.1. Image Segmentation as an Optimization Problem* of the meta-heuristic algorithms previously mentioned in Section 2, and its application to the

formed for subsequent classification.

The proposal of the present investigation for phase 1 of the method consists of the modification of the meta-heuristic algorithms previously mentioned in Section 2, and its application to the segmentation of mammography images. segmentation of mammography images. Performing the characterization of the mammographic image through the use of meta-heuristic algorithms outlines the problem of segmenting the image as an optimization problem. For this

Performing the characterization of the mammographic image through the use of meta-heuristic algorithms outlines the problem of segmenting the image as an optimization problem. For this purpose, the Dunn index definition is considered as the objective function in all cases. purpose, the Dunn index definition is considered as the objective function in all cases. We use the Dunn index [43] as the fitness function for the meta-heuristic algorithms, so that the higher the Dunn index, the better the segmentation. Thus, in the three meta-heuristic algorithms,

We use the Dunn index [43] as the fitness function for the meta-heuristic algorithms, so that the higher the Dunn index, the better the segmentation. Thus, in the three meta-heuristic algorithms, this index is the objective function to be maximized, and it is defined in Equation (3). We calculate this index as the minimum outer-cluster distance over the maximum inter-cluster distance: this index is the objective function to be maximized, and it is defined in Equation (3). We calculate this index as the minimum outer-cluster distance over the maximum inter-cluster distance: ݑݎ݃ ݐ݁݊ݎ݂݂݀݅݁ ܽ ݂ ݏ݈݁ݔ݅ ݁݁݊ݓݐܾ݁ ܽ݊ܿ݁ݐݏ݀݅ ݉ݑ݅݊݅݉ܯ = ݂ (3)ݑݎ݃ ܽ݉݁ݏ ݄݁ݐ ݂ ݏ݈݁ݔ݅ ݁݁݊ݓݐܾ݁ ܽ݊ܿ݁ݐݏ݀݅ ݉ݑ݅݉ݔܽܯ

$$f = \frac{\text{Minimum distance between pixels of a different group}}{\text{Maximum distance between pixels of the same group}} \tag{3}$$

The optimization function allows you to calculate the dispersion between the pixels of different groups and pixels of the same group. It is sought to maximize this function since we prefer for the numerator to be a very large value and the denominator to be a tiny value. The distance calculated in Equation (3) is the Minkowski distance of order *p* = 1. This distance for the order *p* is defined Equation (4) with 1. 

numerator to be a very large value and the denominator to be a tiny value.

The distance calculated in Equation (3) is the Minkowski distance of order *p* = 1. This distance for the order *p* is defined Equation (4) with *p* ≥ 1. |ݕ െ ݔ| ሺ ୀଵ ሻ ଵൗ (4)

$$\left(\sum\_{i=1}^{n} |\mathbf{x}\_{i} - y\_{i}|^{p}\right)^{\frac{1}{p}}\tag{4}$$

The Minkowski distance for order 1 is reduced to a difference between pixel values between 0 and 255. in the interval [0,255], and represent a grey level. Three components are defined since it is desired to segment the image into three regions: background, breast area, and lesion. An example of this coding is shown in Figure 3. Figure 3 represents an individual with three components, each of them

Three-dimensional vectors for the coding of the solutions are used; the components are defined in the interval [0,255], and represent a grey level. Three components are defined since it is desired to segment the image into three regions: background, breast area, and lesion. An example of this coding is shown in Figure 3. Figure 3 represents an individual with three components, each of them representing a grey level. representing a grey level. This coding allows us to consider each solution as a candidate segmentation. Considering the solution of Figure 3, where the components represent a region of the image, we can obtain segmentation, as shown in the same figure, for Individual 1. For each pixel in the image, we calculate the distance to each component value, and we will assign the pixel to the component with the lowest distance.

**Figure 3. Figure 3.** Example of segmentation using th Example of segmentation using the solution of Individual 1. e solution of Individual 1.

*Appl. Sci.* **2019**, *9*, 4492

This coding allows us to consider each solution as a candidate segmentation. Considering the solution of Figure 3, where the components represent a region of the image, we can obtain segmentation, as shown in the same figure, for Individual 1. For each pixel in the image, we calculate the distance to each component value, and we will assign the pixel to the component with the lowest distance. *Sensors* **2019**, *19*, x FOR PEER REVIEW *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 8 of 16 According to Zhang et al. [44], the evaluation of the segmentation uses the average squared color error (*F*, Equation (5)), which penalizes over-segmentations, and its improved version (*F'*,

According to Zhang et al. [44], the evaluation of the segmentation uses the average squared color error (*F*, Equation (5)), which penalizes over-segmentations, and its improved version (*F'*, Equation (6)), which penalizes segmentations with a much greater number of small regions. Equation (6)), which penalizes segmentations with a much greater number of small regions. ଶ݆݁ ܰ = √ܨ ே (5)

1

$$F = \sqrt{N} \sum\_{j=1}^{N} \frac{e^{j^2}}{\sqrt{Sj}} \tag{5}$$

$$F' = \frac{1}{1000 \star SI} \sqrt{\sum\_{b=1}^{\text{MaxArea}} N(b)^{1 + \frac{1}{b}}} \sum\_{j=1}^{N} \frac{ej^2}{\sqrt{Sj}} \tag{6}$$

$$\left(e f^{2}(\mathbb{R}j) = \sum\_{p \in \mathbb{R}j} \left(\mathbb{C}x(p) - \hat{\mathbb{C}x}(\mathbb{R}j)\right)^{2} \tag{7}$$
 
$$\Gamma\_{\ldots,\ldots,\widehat{\mathbb{C}x}(p)}$$

$$\text{Cat}(Rj) = \frac{\sum\_{p \in Rj} \text{Cat}(p)}{Sj} \tag{8}$$

where *N* represents the number of regions in the image, *Sj* is the quantity of pixels in the region *j*, *SI* is the area of the image, *N*(*b*) is the number of regions of the segmented image that have exactly *b* units of area, *ej*<sup>2</sup> is the squared colour error of region *j* defined in Equation (7). *Cx*(*p*) is the value of component *x* for pixel *p* and Cˆ *x*(*Rj*) is the average value of component *x* in the region *j* is defined in Equation (8). ܵܫ is the area of the image, ܰሺܾሻ is the number of regions of the segmented image that have exactly *b* units of area, ݆݁ଶ is the squared colour error of region *j* defined in Equation (7). ݔܥሺሻ is the value of component *x* for pixel *p* and Ĉݔሺܴ݆ሻ is the average value of component *x* in the region *j* is defined in Equation (8).

In Figure 4, you can observe an example of the use of the Dunn index to identify the best segmentation. We observe three individuals with three components, where each component represents a grey level used to generate the segmented image. The calculated fitness, shown in Figure 4, is the value calculated using the Dunn index, which measures how good the segmentation is according to the distance between clusters. In Figure 4, you can observe an example of the use of the Dunn index to identify the best segmentation. We observe three individuals with three components, where each component represents a grey level used to generate the segmented image. The calculated fitness, shown in Figure 4, is the value calculated using the Dunn index, which measures how good the segmentation is according to the distance between clusters.

**Figure 4.** Example of individual values and calculation of fitness function.

In this paper we compare four segmentation algorithms, the Otsu method [24] and three customized meta-heuristic algorithms [31,40,45], to perform the segmentation task. **Figure 4.** Example of individual values and calculation of fitness function. In this paper we compare four segmentation algorithms, the Otsu method [24] and three customized meta-heuristic algorithms [31,40,45], to perform the segmentation task.

### *3.2. Extraction of Characteristics*

For the diagnosis stage, a group of radiologists defined the characteristics. The characteristics are extracted from the region of interest of the segmented image and each definition is presented by Moura [46] and it explains the formulas for each calculation. This manuscript will present some descriptors in Table 1.


#### **Table 1.** Characteristics extracted from the images.

#### *3.3. Classification of the Lesion*

For this step, a new artificial immune system (AIS) is proposed, which presents a competitive performance for breast cancer classification.

The proposed algorithm uses two different responses (adaptive immune response and innate immune response), which can be seen analogously as the training phase and classification phase in pattern recognition. The innate immune response will use the antibodies generated in the previous response to classifying each pattern according to the most similar antibody, while the adaptive immune response is based on grouping the patterns into random groups without replacement for each of the classes.

Later, a mean pattern (antigen) will be calculated from the groups formed above. The class of the grouped patterns will be assigned to each generated pattern. The mean calculates the performance of the aforementioned set of patterns, using the antibodies as models to classify.

The algorithm performs the adjustment of the antibodies by making the antibodies approach their most similar pattern and moving them away from different class patterns.

Subsequently, it generates clones for each antibody, and it controls this increase, obtaining an average of the clones generated by each antibody.

In the adaptive immune response, the algorithm will generate artificial antibodies that will be used as structures that will remember recognized patterns (antigens) and will have the ability to identify new antigens.

Once it completes the adaptive immune response (training phase), it will use the final antibodies generated for the classification of the other patterns (innate immune response). This classification will be performed by finding the closest antibody to each pattern, so the assigned class of each pattern will be one of the closest antibodies.

As a summary:

The adaptive immune response consists of five phases:


5. Resolution of threat: Antibodies are stored in the immune memory to be used in the innate immune response.

The Innate immune response consists of one phase:

1. Resolution of threat: For each pattern, its similarity will be compared with the antibodies stored in immune memory. The antibody class of the antibody with greater similarity will be assigned.

The results applied in mammographic images are presented in the following section.

#### **4. Experimental Results**

Experimental tests were performed by segmenting images with three proposed segmentation algorithms. These results were compared to the images produced by the Otsu method.

Subsequently, the errors obtained by each algorithm and the best algorithm to segment are presented.

In the same way, the classification of six datasets using three algorithms are carried out (the presented model and two classic classification algorithms).

Performances are calculated and the algorithm with the best performance for each dataset is obtained.

#### *4.1. Selection of the Best Segmentation Algorithm*

To select the best segmentation algorithm, three meta-heuristic algorithms and the Otsu method were used to analyze mammographies from the Breast Cancer Digital Repository [46], which come from a Portuguese breast cancer database of real patients, which comprises 362 segmentations. The Faculty of Medicine of the University of Porto, Portugal provided this database made by expert radiologists, and they proved by biopsies the classifications provided (benign or malign).

Otsu parameters were used as a default and implemented in Matlab™. The parameters for the NBA were: three bats, three regions, 10 iterations, and other parameters as proposed by Meng et al. [40] were used. These parameters showed a lower error for segmentation according to the study of González-Patiño et al. in 2016 [42]. For the genetic algorithm, 50 individuals were used, with three regions, 10 iterations, a mutation rate of 0.05 and a crossover rate of 0.7. Finally, for simulated annealing an initial temperature of 100,000 and a cooling rate of 0.05 were used.

Errors for each image were calculated and we present the mean of the results in Table 2. The following images show two of the experiments carried out using images from mammographies. The first row shows the original mammography image, the region of interest and the region of interest overlapped in the original image. The second row shows the results applying the Otsu method, and the binarized segmentation obtained by the novel bat algorithm and the genetic algorithm. The final row shows the segmented image produced by the simulated annealing algorithm applied to segmentation.


**Table 2.** Mean of the errors calculated for the segmentation of the 362 mammographies.

Figure 5 shows a similar image (in shape) produced by Otsu and GA, while the NBA method segmented a different region. The NBA segmented a smaller region, which seems to be more accurate in relation to the original segmented image. The four algorithms got similar regions since the mammography has a very defined region of interest, however this is not the case in all images.

of the dataset.

of the dataset.

error while NBA and GA have lower errors, with GA being the algorithm with the lowest error.

given that GA is a more complex algorithm than the simulated annealing and the NBA.

*Sensors* **2019**, *19*, x FOR PEER REVIEW *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 11 of 16

The behavior observed in the meta-heuristics results from the simplicity of the algorithms;

Figures 5 and 6 reveals that the images segmented by GA have a smaller area compared to the images segmented by the Otsu method or NBA. This is relevant because the region obtained by GA is more similar to the original region obtained by the expert radiologist. It is necessary to take into consideration that the image acquired by some algorithms includes noisy pixels that do not correspond to the desired image, which is why in the results it is shown that GA got a better segmentation when contemplating all the corresponding pixels. In both cases, the algorithm which produced the most similar region to the original segmented image by the expert was genetic algorithms. Table 2 and Figure 7 show the mean of the errors for all the 362 mammography images

given that GA is a more complex algorithm than the simulated annealing and the NBA.

**Figure 5.** (**a**) Original image, (**b**) Region of interest (ROI) segmented by an expert radiologist, (**c**) Overlap of the ROI over the original image, (**d**) Image segmented by the Otsu method, (**e**) Image segmented by the novel bat algorithm (NBA), (**f**) Image segmented by the genetic algorithm (GA), (**g**) Image segmented by simulated annealing (SA). **Figure 5.** (**a**) Original image, (**b**) Region of interest (ROI) segmented by an expert radiologist, (**c**) Overlap of the ROI over the original image, (**d**) Image segmented by the Otsu method, (**e**) Image segmented by the novel bat algorithm (NBA), (**f**) Image segmented by the genetic algorithm (GA), (**g**) Image segmented by simulated annealing (SA). (**a**) (**b**) (**c**)

The behavior observed in the meta-heuristics results from the simplicity of the algorithms; given that GA is a more complex algorithm than the simulated annealing and the NBA.

 (**a**) (**b**) (**c**) Figures 5 and 6 reveals that the images segmented by GA have a smaller area compared to the images segmented by the Otsu method or NBA. This is relevant because the region obtained by GA is more similar to the original region obtained by the expert radiologist. It is necessary to take into consideration that the image acquired by some algorithms includes noisy pixels that do not correspond to the desired image, which is why in the results it is shown that GA got a better segmentation when contemplating all the corresponding pixels. In both cases, the algorithm which produced the most similar region to the original segmented image by the expert was genetic algorithms. Table 2 and Figure 7 show the mean of the errors for all the 362 mammography images of the dataset. (**d**) (**e**) (**f**) (**g**) **Figure 5.** (**a**) Original image, (**b**) Region of interest (ROI) segmented by an expert radiologist, (**c**) Overlap of the ROI over the original image, (**d**) Image segmented by the Otsu method, (**e**) Image segmented by the novel bat algorithm (NBA), (**f**) Image segmented by the genetic algorithm (GA), (**g**) Image segmented by simulated annealing (SA).

**Figure 6.** (**a**) Original image, (**b**) ROI segmented by an expert radiologist, (**c**) Overlap of the ROI over the original image, (**d**) Image segmented by the Otsu method, (**e**) Image segmented by NBA, (**f**) Image segmented by GA, (**g**) Image segmented by SA. **Figure 6.** (**a**) Original image, (**b**) ROI segmented by an expert radiologist, (**c**) Overlap of the ROI overthe original image, (**d**) Image segmented by the Otsu method, (**e**) Image segmented by NBA, (**f**) Imagesegmented by GA, (**g**) Image segmented by SA.

of candidate solutions.

Even if the Otsu method is a frequently used method for image thresholding, it can be observed that the Otsu method's error is 1.42 times higher than that of the NBA and 1.54 times higher than that of the GA. The error acquired by the simulated annealing algorithm was close to but lower than the Otsu method. In addition, the NBA's error is 1.08 times higher than the GA's; showing that the GA has the lowest mean error for the segmentation of breast cancer mammographies. The GA showed a lower error because of its capability to explore a higher number of candidate solutions,

**Figure 7.** *F* and *F'* Mean error values of each algorithm. **Figure 7.** *F* and *F'* Mean error values of each algorithm.

Run times for each algorithm were calculated, and the mean run time is shown in Table 3. Table 2 shows a descending error of the algorithms, which showed that Otsu has the highest error while NBA and GA have lower errors, with GA being the algorithm with the lowest error.

**Table 3.** Mean time for each algorithm. **Algorithm Mean Time (sec)**  Otsu method 0.051 Simulated annealing 130.406 Novel bat algorithm 53.830 Genetic algorithm 170.830 Even if the Otsu method is a frequently used method for image thresholding, it can be observed that the Otsu method's error is 1.42 times higher than that of the NBA and 1.54 times higher than that of the GA. The error acquired by the simulated annealing algorithm was close to but lower than the Otsu method. In addition, the NBA's error is 1.08 times higher than the GA's; showing that the GA has the lowest mean error for the segmentation of breast cancer mammographies. The GA showed a lower error because of its capability to explore a higher number of candidate solutions, which resulted in a better segmentation compared to the NBA, which also explores a large number of candidate solutions.

According to the mean times, Otsu method had the lowest mean run time, in contrast to its Run times for each algorithm were calculated, and the mean run time is shown in Table 3.



The datasets used were: 1. Breast Cancer Digital Repository (BCDR) [46]. The Faculty of Medicine of the University of Porto, Portugal provided this dataset, and it is used to explore the computer-based detection According to the mean times, Otsu method had the lowest mean run time, in contrast to its higher error in the segmentation. The genetic algorithm had the lowest segmentation error, but it had the highest mean run time.

#### 2. Breast Cancer Wisconsin (Original) Data Set (BCWO) [47,48]. The University of Wisconsin, USA *4.2. Classification of the Lesions*

and diagnosis methods.

provided this dataset. This dataset contains data collected from clinical cases by Dr. William Wolberg. 3. Breast Cancer Wisconsin (Prognostic) Data Set (BCWP) [48]. Dr. Wolberg provided this dataset, The proposed classification algorithm with six cancer-related datasets was tested, and the results are presented with the classification performances in the following section, including the comparison with two classic algorithms utilized in experiments and published in the literature.

and it contains the follow-up of breast cancer cases. 4. Lung Cancer Data Set (LCDS) [48]. Hong and Young used this dataset to apply the k-nearest The datasets used were:


Similarly, a brief description of the classical algorithms used for classification is included. Support Vector Machines (SVM) [50] are algorithms used for classification and regression. These algorithms build a model that represents the patterns of the training set. Their main objective is to find a hyperplane that separates two classes. These algorithms work ideally for two classes; however, it is possible to use strategies that allow them to be used to build a model able to separate two classes.

Repeated incremental pruning to produce error reduction (RIPPER) [51] is an algorithm based on association rules with reduced error pruning; this technique is frequently used in decision tree algorithms. The generation of rules is performed by applying pruning operators to reduce the error. The algorithm ends when the error is increased after a pruning operation.

Regarding the classification process, we present the results of classifier accuracy comparing the two algorithms with the artificial immune system in Table 4.


**Table 4.** Classification performances.

We repeated the classification process 10 times and averaged the performance of each algorithm. The classification accuracy is calculated as the number of correctly classifying elements among the total number of elements. The best accuracy for each dataset is bolded in Table 4.

As it can be observed, the artificial immune system (AIS) proposed got the best performance in four out of six datasets, which is interesting considering that the other algorithms are widely used in the literature.

#### **5. Discussion and Conclusions**

The meta-heuristic-based segmentations showed a lower error compared to the Otsu method, which is relevant since the Otsu method is one of the classic and still used methods for thresholding. These algorithms showed better performances than the Otsu method, even when using them without testing different configurations.

When contrasting the meta-heuristic with the highest error (SA) and the Otsu method, it can be observed that the error presented in simulated annealing is lower than that of Otsu, which represents an important fact since Otsu is a well-known and useful method for thresholding.

According to the runtime test, it was observed that the Otsu method had the lowest run time. However, implementing this method with the Matlab™ software resulted in an optimized and reviewed version of the algorithm. Implementing the meta-heuristic algorithms was a dummy one, with no use of parallel programming.

Many of the meta-heuristics have numerous parameters, which increases the performances of the algorithm depending on the values for each parameter. In this work, we change no parameters for such algorithms, and we did not prove other configurations, but in future work, this could be improved.

The classification model presented (AIS) showed a good performance in most of the datasets where it was tested. This is relevant since it is a new intelligent computing algorithm that opens a research gap in bio-inspired algorithms.

The proposed method can be used as a guide for the radiologists when there is a high demand for processing mammograms and it can also be used as a second opinion in critical cases.

In future work, we can implement preprocessing for enhancing the contrast of the images, to compare the use of preprocessing before segmenting using the meta-heuristics. This could help the segmentation algorithms to perform a better segmentation. This could help the segmentation algorithms to perform a better segmentation.

Concerning the classification algorithm, it is convenient to explore the use of mutation strategies, the global optima, and to improve the search for the desired solution.

**Author Contributions:** Conceptualization, A.-J.A.-C., D.G.-P., Y.V.-R., and F.K.; Methodology, A.-J.A.-C., D.G.-P., Y.V.-R., and F.K.; Software, Y.V.-R., D.G.-P.; Validation, Y.V.-R., D.G.-P.; Formal Analysis, A.-J.A.-C., D.G.-P., Y.V.-R.; Investigation, A.-J.A.-C., D.G.-P., Y.V.-R., and F.K.; Resources, ; Data Curation, A.-J.A.-C. and Y.V.-R.; Writing—Original Draft Preparation, D.G.-P., and Y.V.-R.; Writing—Review and Editing, A.-J.A.-C., D.G.-P., and Y.V.-R.; Visualisation, A.-J.A.-C. and Y.V.-R.; Supervision, A.-J.A.-C.; Project Administration, A.-J.A.-C. and D.G.-P.

**Funding:** The authors of the present paper would like to thank the following institutions for their economic support to develop this work: Comisión de Operación y Fomento de Actividades Académicas del Instituto Politécnico Nacional (COFAA-IPN), Centro de Investigación en Computación (Centro de Investigación en Computación-IPN), Centro de Innovación y Desarrollo Tecnológico en Cómputo (CIDETEC), Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), and Consejo Nacional de Ciencia y Tecnología (CONACYT).

**Acknowledgments:** The authors would like to thank the Instituto Politécnico Nacional (Secretaría Académica, Comisión de Operación y Fomento de Actividades Académicas, Secretaría de Investigación y Posgrado, Centro de Investigación en Computación, and Centro de Innovación y Desarrollo Tecnológico en Cómputo), the Consejo Nacional de Ciencia y Tecnología (Conacyt), and Sistema Nacional de Investigadores in México for their economic support to develop this work. In addition, the authors gratefully acknowledge the Electrical and Computer Engineering department of the University of Waterloo, Ontario in Canada for their support to develop this work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*

## **Quantitative Analysis of Benign and Malignant Tumors in Histopathology: Predicting Prostate Cancer Grading Using SVM**

**Subrata Bhattacharjee <sup>1</sup> , Hyeon-Gyun Park <sup>1</sup> , Cho-Hee Kim <sup>2</sup> , Deekshitha Prakash <sup>1</sup> , Nuwan Madusanka <sup>1</sup> , Jae-Hong So <sup>2</sup> , Nam-Hoon Cho <sup>3</sup> and Heung-Kook Choi 1,\***


Received: 12 June 2019; Accepted: 22 July 2019; Published: 24 July 2019

**Abstract:** An adenocarcinoma is a type of malignant cancerous tissue that forms from a glandular structure in epithelial tissue. Analyzed stained microscopic biopsy images were used to perform image manipulation and extract significant features for support vector machine (SVM) classification, to predict the Gleason grading of prostate cancer (PCa) based on the morphological features of the cell nucleus and lumen. Histopathology biopsy tissue images were used and categorized into four Gleason grade groups, namely Grade 3, Grade 4, Grade 5, and benign. The first three grades are considered malignant. K-means and watershed algorithms were used for color-based segmentation and separation of overlapping cell nuclei, respectively. In total, 400 images, divided equally among the four groups, were collected for SVM classification. To classify the proposed morphological features, SVM classification based on binary learning was performed using linear and Gaussian classifiers. The prediction model yielded an accuracy of 88.7% for malignant vs. benign, 85.0% for Grade 3 vs. Grade 4, 5, and 92.5% for Grade 4 vs. Grade 5. The SVM, based on biopsy-derived image features, consistently and accurately classified the Gleason grading of prostate cancer. All results are comparatively better than those reported in the literature.

**Keywords:** prostate cancer; histopathology; microscopic; tissue image; segmentation; morphological; quantitative; classification; SVM

### **1. Introduction**

Prostate adenocarcinoma, a type of prostate cancer, is the second most commonly diagnosed cancer. In the United States, the incidence of prostate cancer ranks first among all malignant tumors in men. The Gleason score is currently the most common grading system of prostate adenocarcinoma and is widely used to assess the prognosis of men with prostate cancer using samples from a prostate biopsy. There are some diagnostic protocols for cancer grading, for which microscopic evaluation of tissue specimens is required. For this, the samples need to be appropriately stained using Hematoxylin and Eosin (H&E) compounds. The cancer grade is assessed by a pathologist based on the morphological features of lumen and cell nucleus observed in the tissue. Cancer diagnosis and grading based on digital pathology have become increasingly complex due to the increase in cancer occurrence and specific treatment options for patients [1].

In South Korea, the incidence of prostate cancer is increasing significantly. Prostate cancer (PCa) is the fifth most common cancer among males in Korea and the expected cancer deaths in 2018 were 82,155 [2]. The detection of prostate cancer has always been a major issue for pathologists and medical

practitioners, for both diagnosis and treatment. Usually, the cancer detection process in histopathology consists of categorizing stained microscopic biopsy images into malignant and benign.

The Gleason grade grouping system defines Gleason scores ≤ 6 as grade 1, score 3 + 4 = 7 as grade 2, score 4 + 3 = 7 as grade 3, score 4 + 4, 3 + 5 or 5 + 3 = 8 as grade 4, and score 4 + 5, 5 + 4 or 5 + 5 = 9 or 10 as grade 5. The Gleason score is obtained by adding the primary (most common) and secondary (second most common) scores from H&E stained tissue microscopic images. This system was developed by Dr. Donald F Gleason, who was a Pathologist in Minnesota, and members of the Veterans Administration Cooperative Urological Research Group (VACURG) [3]. This system was tested on a large number of patients, including long-term follow-ups and is considered an outstanding success.

In recent years, an excellent and important addition to microscopy and digital imaging has been developed for microscopes that are used to convert stained tissue slides into whole slide digital images. This allows for more efficient computer-based viewing and analysis of histopathology. Early diagnosis and treatment are required, to avoid the enlargement of cancer cells in the prostate gland and control the spreading of more aggressive tumors to other parts of the body.

The digital pathology field has grown dramatically over recent years, largely due to technological advancements in image processing and machine learning algorithms, and increases in computational power. As part of this field, many methods have been proposed for automatic histopathological image analysis and classification. In this paper, color segmentation, based on k-means clustering method, is proposed for microscopic biopsy tissue image processing, and the watershed algorithm has been implemented to separate touching cell nuclei in tissue images.

This approach can be implemented in different ways; however, the marker selection approach has been carried out in this study to control over-segmentation. Diagnosing prostate cancer from a biopsy tissue image under a microscope is difficult for the pathologists and doctors. Therefore, machine learning and deep learning techniques are developed for computerized classification and cancer grading. In this study, a machine learning classification method is proposed in order to classify Gleason grade groups of prostate cancer. From a perspective of computer engineering, since the regular procedure of diagnosing prostate cancer and grading is difficult and time consuming; therefore, automated computerized methods are in high demand and are essential for medical image analysis.

#### **2. Literature Review**

Tabesh et al. [4] extracted features that describe color, texture, and morphology from 367 and 268 H&E image patches, which were acquired from tissue microarray (TMA) datasets. These features were used for support vector machine (SVM) classification. They achieved an accuracy of 96.7% and 81% for predicting benign vs. malignant and low-grade vs. high-grade classifications, respectively, using 5-fold cross-validation.

Doyle et al. [5] proposed a cascade approach to the multi-class grading problem. They used cascade binary classification to maximize inter- and intra-class accuracy rather than the conventional one-shot classification and one-versus-all approaches to multi-class classification. In the proposed cascade approach, each division is classified separately and independently.

Nir et al. [6] proposed some novel features based on intra- and inter-nuclei properties for classification. They trained their classifier on 333 tissue microarray (TMA) cores annotated by six pathologists for different Gleason grades and used SVM classification to achieve an accuracy of 88.5% and 73.8% for cancer detection (benign vs. malignant) and low vs. high grade (Grade 3 vs. Grade 4, 5), respectively.

Doyle et al. [7] extracted nearly 600 image texture features to perform pixel-wise Bayesian classification at each image scale to obtain the corresponding likelihood scene. The authors achieved an accuracy of 88.0% for distinguishing between benign and malignant samples.

Rundo et al. [8] proposed Fuzzy C-Means (FCM) clustering algorithm for prostate multispectral MRI morphologic data processing and segmentation. The authors used co-registered T1w and T2w MR image series and achieved an average dice similarity coefficient 90.77 ± 7.75, with respect to 81.90 ± 6.49 and 82.55 ± 4.93 by processing T2w and T1w imaging alone, respectively.

Jiao et al. [9] used combined deep learning and SVM methods for breast masses classification. The methods were applied to the Digital Database for Screening Mammography (DDSM) dataset and achieved high accuracy under two objective evaluation measures. The authors used nearly 600 images, out of these, 50% were benign and 50% were malignant. The classification accuracy achieved in this paper was 96.7% for distinguishing between benign and malignant samples.

Hu et al. [10] presented a novel mass detection system for digital mammograms, which integrated a visual saliency model with deep learning techniques. The authors used combined deep learning and SVM methods for image and feature classification, respectively. They achieved an average accuracy of 91.5% in mass detection between cancer and benign datasets.

Naik et al. [11] presented a method for automated histopathology images. They have demonstrated the utility of glandular and nuclear segmentation algorithm in accurate extraction of various morphological and nuclear features for automated grading of prostate cancer, breast cancer, and distinguishing between cancerous and benign breast histology specimen. The authors used a SVM classifier for classification of prostate images containing 16 Gleason grade 3 images, 11 grade 4 images, and 17 benign epithelial images of biopsy tissue. They achieved an accuracy of 95.19% for grade 3 vs. grade 4, 86.35% for grade 3 vs. benign, and 92.90% for grade 4 vs. benign.

Nguyen et al. [12] introduced a novel approach to grade prostate malignancy using digitized histopathological specimens of the prostate tissue. They have extracted tissue structural features from the gland morphology and co-occurrence texture features from 82 regions of interest (ROI) with 620 × 550 pixels to classify a tissue pattern into three major categories: benign, grade 3 carcinoma, and grade 4 carcinoma. The authors proposed a hierarchical (binary) classification scheme and obtained 85.6% accuracy in classifying an input tissue pattern into one of the three classes.

Albashish et al. [13] proposed some texture features, namely Haralick, Histogram of Oriented Gradient (HOG), and run-length matrix, which have been extracted from nuclei and lumen images individually. They used a total of 149 images with 4140 × 3096 pixels, and the dataset was randomly divided into 50% for training and 50% for testing. An ensemble machine learning classification system was proposed, and achieved an accuracy of 88.9% for Grade 3 vs. Grade 4, 92.4% for benign vs. Grade 4, and 97.85% for benign vs. Grade 3. These accuracies were averaged over 50 simulation runs and statistical significance.

Diamond et al. [14] used morphological and texture features to classify the sub-region of 100 × 100 pixels and subjected each to image-processing techniques. They classified a tissue image into either stroma or prostatic carcinoma. In addition, the authors used lumen area to discriminate benign tissue from the other two classes. As a result, 79.3% of sub-regions were correctly classified.

Ding et al. [15] introduced an automated image analysis framework capable of efficiently segmenting microglial cells from histology images and analyzing their morphology. Their experiments show that the proposed framework is accurate and scalable for large datasets. They extracted three types of features for SVM classification, namely Mono-fractal, Multi-fractal, and Gabor features.

Yang et al. [16] used image processing and machine learning algorithms to analyze the smear images captured by the developed image-based cytometer. A low-cost, portable image-based cytometer was built for image acquisition from Giemsa stained blood smear. The authors selected 50 images manually for the training set, out of these, 25 images were parasites and 25 images were non-parasites. The selected images were then segmented separately to extract the features for Support Vector Machine (SVM) classification, and they used linear kernel classifier to train and test these features.

#### **3. Materials and Methods 3. Materials and Methods 3. Materials and Methods**

#### *3.1. Tissue Image Dataset 3.1. Tissue Image Dataset 3.1. Tissue Image Dataset*

The histopathology images that were congregated to create our dataset are sub-images of benign and malignant samples. These sub-images were cropped from the whole-slide microscopic tissue images stained with H&E, shown in Figure 1. The data were collected from Severance Hospital of Yonsei University and the grading of these data was histologically confirmed by a pathologist. The whole slide size in Figure 1a–d is 33,584 × 70,352 pixels. The patch image magnification is 40× for Figure 1e–h and the image size is 512 × 512 pixels. We selected 400 sub-images for feature extraction and SVM classification. These were divided into four groups, namely Grade 3, Grade 4, Grade 5, and Benign. The histopathology images that were congregated to create our dataset are sub-images of benign and malignant samples. These sub-images were cropped from the whole-slide microscopic tissue images stained with H&E, shown in Figure 1. The data were collected from Severance Hospital of Yonsei University and the grading of these data was histologically confirmed by a pathologist. The whole slide size in Figure 1a–d is 33,584 × 70,352 pixels. The patch image magnification is 40× for Figure 1e–h and the image size is 512 × 512 pixels. We selected 400 sub-images for feature extraction and SVM classification. These were divided into four groups, namely Grade 3, Grade 4, Grade 5, and Benign. The histopathology images that were congregated to create our dataset are sub-images of benign and malignant samples. These sub-images were cropped from the whole-slide microscopic tissue images stained with H&E, shown in Figure 1. The data were collected from Severance Hospital of Yonsei University and the grading of these data was histologically confirmed by a pathologist. The whole slide size in Figure 1a–d is 33,584 × 70,352 pixels. The patch image magnification is 40× for Figure 1e–h and the image size is 512 × 512 pixels. We selected 400 sub-images for feature extraction and SVM classification. These were divided into four groups, namely Grade 3, Grade 4, Grade 5, and Benign.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 4 of 16

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 4 of 16

**Figure 1.** Microscopic biopsy images stained with Hematoxylin and Eosin (H&E) compound; (**a**–**d**) whole slide tissue images of Grade 3, Grade 4, Grade 5, and Benign; and (**e**–**h**) the regions of interest (ROIs) taken from whole-slide images (**a**), (**b**), (**c**), (**d**) respectively. The dark blue is the cell nucleus, pink is the stroma, and white is the lumen. **Figure 1.** Microscopic biopsy images stained with Hematoxylin and Eosin (H&E) compound; (**a**–**d**) whole slide tissue images of Grade 3, Grade 4, Grade 5, and Benign; and (**e**–**h**) the regions of interest (ROIs) taken from whole-slide images (**a**), (**b**), (**c**), (**d**) respectively. The dark blue is the cell nucleus, pink is the stroma, and white is the lumen. **Figure 1.** Microscopic biopsy images stained with Hematoxylin and Eosin (H&E) compound; (**a**–**d**) whole slide tissue images of Grade 3, Grade 4, Grade 5, and Benign; and (**e**–**h**) the regions of interest (ROIs) taken from whole-slide images (**a**), (**b**), (**c**), (**d**) respectively. The dark blue is the cell nucleus, pink is the stroma, and white is the lumen.

Figure 1 shows the sub-images that were used to detect cell nuclei and classify prostate cancer. It is a very challenging task to classify different Gleason grades because images usually contain many clusters and overlapping objects. Figure 2 shows the entire proposed process for predicting cancer gradings based on microscopic images. The pipeline model includes original biopsy image, region of interest (ROI) segmentation, watershed segmentation, features extraction, classification, and analysis results [16]. Figure 1 shows the sub-images that were used to detect cell nuclei and classify prostate cancer. It is a very challenging task to classify different Gleason grades because images usually contain many clusters and overlapping objects. Figure 2 shows the entire proposed process for predicting cancer gradings based on microscopic images. The pipeline model includes original biopsy image, region of interest (ROI) segmentation, watershed segmentation, features extraction, classification, and analysis results [16]. Figure 1 shows the sub-images that were used to detect cell nuclei and classify prostate cancer. It is a very challenging task to classify different Gleason grades because images usually contain many clusters and overlapping objects. Figure 2 shows the entire proposed process for predicting cancer gradings based on microscopic images. The pipeline model includes original biopsy image, region of interest (ROI) segmentation, watershed segmentation, features extraction, classification, and analysis results [16].

**Figure 2.** Proposed pipeline model for predicting cancer grading from microscopic biopsy images. **Figure 2. Figure 2.**  Proposed pipeline model for predicting cancer grading from microscopic biopsy images. Proposed pipeline model for predicting cancer grading from microscopic biopsy images.

#### *3.2. ROI Segmentation 3.2. ROI Segmentation*

Image segmentation plays an important role in medical image processing systems. The nuclei and lumen of prostate cancer are the most important components of histopathological images [17]. To identify cell nuclei and lumen from images and carry out systematic processing, a K-means clustering algorithm was applied using MATLAB R2018a (The MathWorks, Natick, MA, USA) [18], where image pixels were partitioned into three clusters (thus, k = 3). The segmented components from the tissue images are: stroma, lumen, and the cell nucleus. However, nucleus and lumen components were selected for feature extraction and SVM classification, as shown in Figure 3 [19]. Image segmentation plays an important role in medical image processing systems. The nuclei and lumen of prostate cancer are the most important components of histopathological images [17]. To identify cell nuclei and lumen from images and carry out systematic processing, a K-means clustering algorithm was applied using MATLAB R2018a (The MathWorks, Natick, MA, USA) [18], where image pixels were partitioned into three clusters (thus, k = 3). The segmented components from the tissue images are: stroma, lumen, and the cell nucleus. However, nucleus and lumen components were selected for feature extraction and SVM classification, as shown in Figure 3 [19].

**Figure 3.** Image segmentation using K-means algorithm: (**a**) original tissue image; (**b**) lumen segmentation; and (**c**) nucleus segmentation. **Figure 3.** Image segmentation using K-means algorithm: (**a**) original tissue image; (**b**) lumensegmentation; and (**c**) nucleus segmentation.

According to our visual results, the K-means based method is best suited for microscopic biopsy images. K-means segmentation has been applied here to separate the nucleus and lumen tissue components from microscopic biopsy images. The K-means algorithm uses iterative modification to produce a final result. The following algorithm iterates between two steps: According to our visual results, the K-means based method is best suited for microscopic biopsy images. K-means segmentation has been applied here to separate the nucleus and lumen tissue components from microscopic biopsy images. The K-means algorithm uses iterative modification to produce a final result. The following algorithm iterates between two steps:

1. Data assignment step: 1. Data assignment step:

$$\underset{c\_k \in \mathbb{C}}{\text{argmin}} \, dist(c\_k, \mathbf{x})^2 \tag{1}$$

2. Centroid update step: 2. Centroid update step:

$$c\_k = \frac{1}{|s\_k|} \sum\_{\mathbf{x}\_k \in s\_k} x\_k \tag{2}$$

The K-means algorithm is composed of the following steps:


#### *3.3. Watershed Segmentation 3.3. Watershed Segmentation*

The watershed transform is an image processing technique that can be applied to a binary image for object segmentation. In the segmented images of nucleus tissue components, we observed that there were many overlapping cell nuclei. We separated these connected objects by applying the watershed segmentation algorithm [20,21]. This method was used to extract nucleus-based morphological features for SVM classification. We validated this algorithm experimentally and found that it performs better than other cell nuclei separation algorithms. It is one of the well-known methods for separating overlapping objects [22]. The watershed transform is an image processing technique that can be applied to a binary image for object segmentation. In the segmented images of nucleus tissue components, we observed that there were many overlapping cell nuclei. We separated these connected objects by applying the watershed segmentation algorithm [20,21]. This method was used to extract nucleus-based morphological features for SVM classification. We validated this algorithm experimentally and found that it performs better than other cell nuclei separation algorithms. It is one of the well-known methods for separating overlapping objects [22].

#### Algorithm for Watershed Segmentation

According to the algorithm, *g*(*x*, *y*) and *M<sup>i</sup>* is the image pixel value and the regional minima, respectively. The iteration steps of the algorithm are as follow:

$$T[n] = \{ (\mathbf{x}, \ y) \mid \mathbf{g}(\mathbf{x}, \ y) < n \} \tag{3}$$

$$n = \min + 1 \text{ to } n = \max + 1 \tag{4}$$

$$\mathbb{C}\_{\mathfrak{n}}(M\_{\mathfrak{n}}) = \mathbb{C}(M\_{\mathfrak{i}}) \cap T[\mathfrak{n}] \tag{5}$$

where *T*[*n*] is the set of coordinates of a point in *g*(*x*, *y*), *n* is the flooding stage, and *Cn*(*Mi*) is the set of coordinates of points in the catchment basin.

$$\mathbb{C}\_{\mathfrak{n}}(M\_{\mathfrak{n}}) = 1,\text{ at } (\mathfrak{x},\,\,y);\text{if } (\mathfrak{x},\,\,y) \in \mathbb{C}(M\_{\mathfrak{i}}) \text{ and } (\mathfrak{x},\,\,y) \in T[\mathfrak{n}] \tag{6}$$

$$\mathbb{C}\_n(M\_n) = 0, otherwise \tag{7}$$

We computed the results of the above two equations and viewed the resulting binary image.

$$\mathbb{C}[n] = \bigcup\_{i=1}^{R} \mathbb{C}(M\_i) \tag{8}$$

$$\mathbb{C}[\max \pm 1] = \bigcup\_{i=1}^{R} \mathbb{C}(M\_i) \tag{9}$$

where *C*[*n*] is the union of the flood catchment basin portions at stage set *n*, *C*[*max* + 1] is the union of all catchment basins. As per Equations (8) and (9), *C*[*n*] is the subset of *T*[*n*] and *C*[*n* − 1] is the subset of *C*[*n*]. Hence, each connected component of *C*[*n* − 1] is the connected in exactly one connected component of *T*[*n*].

We used the following steps to separate overlapping nuclei:


We used the described watershed segmentation algorithm to separate the overlapping cell nuclei. This has been used previously for nucleus counting and to extract features for classification [23]. Figure 4 shows the necessary steps for watershed segmentation, including segmenting the nuclei image, converting to a binary image, applying the Euclidean distance transform, and labeling the watershed image using color mapping.

*Appl. Sci.* **2019**, *9*, 2969 *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 6 of 16 the watershed image using color mapping.

*3.4. Feature Extraction* 

connected component of ሾሿ.

2. Removed the noise from the binary image.

overlapping objects were segmented.

4. Used a Gaussian filter to smooth the distance map.

Algorithm for watershed segmentation

respectively. The iteration steps of the algorithm are as follow:

set of coordinates of points in the catchment basin.

According to the algorithm, (, ) and is the image pixel value and the regional minima,

where ሾሿ is the set of coordinates of a point in (, ), is the flooding stage, and () is the

() = 0, *otherwise* (7)

We computed the results of the above two equations and viewed the resulting binary image.

ሾሿ = ራ() ோ

ୀଵ

ோ

ୀଵ

ሾ + 1ሿ = ራ()

where ሾሿ is the union of the flood catchment basin portions at stage set , ሾ + 1ሿ is the union of all catchment basins. As per Equations (8) and (9), ሾሿ is the subset of ሾሿ and ሾ−1ሿ is the subset of ሾሿ. Hence, each connected component of ሾ−1ሿ is the connected in exactly one

1. Converted 24-bit/pixel RGB color image to binary using adaptive thresholding method.

3. Applied the Euclidean distance transform to a binary image to generate a distance map.

7. Finally, applied watershed segmentation based on local minima points, iterating until all

We used the described watershed segmentation algorithm to separate the overlapping cell nuclei. This has been used previously for nucleus counting and to extract features for classification [23]. Figure 4 shows the necessary steps for watershed segmentation, including segmenting the nuclei image, converting to a binary image, applying the Euclidean distance transform, and labeling

We used the following steps to separate overlapping nuclei:

5. Applied inverse distance transform after smoothing the distance map.

6. Identified local minima using markers on the inverse distance transform image.

ሾሿ = ሼ(, ) | (, ) ൏ ሽ (3)

= + 1 to = + 1 (4)

() = 1, at (, ); if (, ) ∈ () and (, ) ∈ ሾሿ (6)

() = () ∩ ሾሿ (5)

(8)

(9)

**Figure 4.** Overview of watershed segmentation: (**a**) original segmented image of nucleus tissue components; (**b**) noise-removed binary image; (**c**) Euclidean distance transform on binary image; and (**d**) result of the watershed algorithm and labelled nuclei using color mapping. **Figure 4.** Overview of watershed segmentation: (**a**) original segmented image of nucleus tissue components; (**b**) noise-removed binary image; (**c**) Euclidean distance transform on binary image; and (**d**) result of the watershed algorithm and labelled nuclei using color mapping.

However, at the beginning of the watershed segmentation, there were some errors leading to over-segmentation, which caused some objects to be divided into several parts, as shown in Figure 5a. To show an example of over-segmentation, we used a cropped image that was taken from the region marked with a red box in Figure 4. First, to control over-segmentation, we used an approach called the marker-selection watershed transform to improve the segmentation results [24]. This approach determines markers for each region of interest and transforms the distance map image in such a way that the region markers are the only local minima of the resulting image. Second, after the Euclidean distance transform, we applied a Gaussian filter to smooth the distance map and then applied internal markers to the smoothed inverse results of the distance transform, as shown in Figure 5b. Third, the watershed algorithm was applied to the marker selection image, as shown in Figure 5c. Finally, the resulting image appeared after removing the noise and watershed lines, and the centroid of each nucleus was labelled, as shown in Figure 5d. However, at the beginning of the watershed segmentation, there were some errors leading to over-segmentation, which caused some objects to be divided into several parts, as shown in Figure 5a. To show an example of over-segmentation, we used a cropped image that was taken from the region marked with a red box in Figure 4. First, to control over-segmentation, we used an approach called the marker-selection watershed transform to improve the segmentation results [24]. This approach determines markers for each region of interest and transforms the distance map image in such a way that the region markers are the only local minima of the resulting image. Second, after the Euclidean distance transform, we applied a Gaussian filter to smooth the distance map and then applied internal markers to the smoothed inverse results of the distance transform, as shown in Figure 5b. Third, the watershed algorithm was applied to the marker selection image, as shown in Figure 5c. Finally, the resulting image appeared after removing the noise and watershed lines, and the centroid of each nucleus was labelled, as shown in Figure 5d.

**Figure 5.** Improvement of over-segmentation: (**a**) over-segmented objects; (**b**) markers applied to the inverse results of the distance transform; (**c**) applied watershed algorithm on images (b); and (**d**) the resulting image after removing the noise and watershed line, and the centroid of the nucleus has been labelled. **Figure 5.** Improvement of over-segmentation: (**a**) over-segmented objects; (**b**) markers appliedto the inverse results of the distance transform; (**c**) applied watershed algorithm on images (b); and (**d**) the resulting image after removing the noise and watershed line, and the centroid of the nucleushas been labelled.

prostate cancer grading and classification, morphological and texture feature extraction is the most common. Training and testing were performed based on the selected data, which were extracted from tissue images. In total, 19 features were extracted from the cell nucleus and lumen and, among these, 14 significant features were selected for SVM classification. The morphological features of the cell nucleus and lumen considered in this paper are: area, perimeter, major axis length, minor axis length, circularity, diameter, nucleus to nucleus distance, nucleus to nucleus minimum distance, eccentricity, and compactness. After watershed segmentation was performed on the nucleus images, cellular level features were extracted to detect and grade prostate cancer using the SVM

#### *3.4. Feature Extraction*

Feature extraction is a very important step in the analysis of prostate cancer and prediction of cancer grades from microscopic biopsy images. The shape and morphological features of prostate cancer are described in References [25,26]. Although different features have been considered for prostate cancer grading and classification, morphological and texture feature extraction is the most common. Training and testing were performed based on the selected data, which were extracted from tissue images. In total, 19 features were extracted from the cell nucleus and lumen and, among these, 14 significant features were selected for SVM classification. The morphological features of the cell nucleus and lumen considered in this paper are: area, perimeter, major axis length, minor axis length, circularity, diameter, nucleus to nucleus distance, nucleus to nucleus minimum distance, eccentricity, and compactness. After watershed segmentation was performed on the nucleus images, cellular level features were extracted to detect and grade prostate cancer using the SVM classification method [27,28]. We used both region- and contour-based methods on the segmented nucleus and lumen images to gather data about the morphological features. To compare all of the extracted features and find the significant features, we used Fisher's coefficient and analysis of variance (ANOVA) to identify the most significant features [29,30]. Table 1 shows descriptions of the significant features of the cell nucleus and lumen. According to the statistical test, all of these features are highly statistically significant (*p* < 0.001).

**Table 1.** Proposed features for support vector machine (SVM) binary classification to classify Gleason grading of prostate cancer.


#### *3.5. Support Vector Machine (SVM) Classification*

In this paper, we used SVM classification of morphological features for cell nucleus and lumen to predict the Gleason grading of prostate cancer. Classification of the various Gleason grade groups from microscopic biopsy images is a very challenging task [31,32]. The classification accuracy depends on different classifiers and their kernel types. An SVM is a supervised learning technique, but it can be applied to both classification and regression problems [33,34]. SVMs can generate optimal hyperplane in an iterative manner that maximizes the margin, where the margin is the largest distance to the nearest training data point of any class.

For classification purposes, we experimented with a few classifiers, such as logistic regression (LR), linear discriminant analysis (LDA), and SVMs. We selected SVMs for this analysis because they achieved better accuracy. Supervised learning approaches generally proceed as follows: prepare the data set for training and testing; choose an appropriate algorithm; select features to fit the model; train the model; use the trained model for prediction. In SVM classification, linear and Gaussian kernel are used to classify samples as benign and malignant and discriminate between Grade 3 vs. Grade 4, 5 and Grade 4 vs. Grade 5 of the Gleason grade groups [35].

We used 2-fold cross-validation to train the model and compared the performance of the different classification models. Later, we adjusted the K-fold cross-validation manually to improve the accuracy [36,37]. The linear kernel, *K*, maps the original data with the kernel function,

$$K(\mathbf{x}) = (\mathbf{x} \, . \, \mathbf{x}' + \mathbf{c}) \tag{10}$$

where *x* is the data and *c* is a constant.

In SVM classification, the gaussian kernel function, used for binary classification was expressed by:

$$K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2), \ \gamma = \frac{1}{2\sigma^2} \tag{11}$$

where *x*, *x* 0 is the feature vector, k*x* − *x* 0 k 2 is the Euclidean distance between two feature vectors, γ is a hyper-parameter, which changes the smoothness of the kernel function, and σ is a free parameter.

To classify Gleason grade groups, we used the proposed binary classification approach, which divides the multi-category classification into multiple two-category groupings. Each division in Figure 6 represents a separate and independent classification, amounting to three binary divisions. In the first sequence, all of the samples in the dataset were classified as "malignant" vs. "benign". Within the cancer group, we separated the dataset between Grade 3 vs. Grade 4+5, and Grade 4 vs. Grade 5, and further classified these using different SVM models [38–40]. *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 9 of 16

**Figure 6.** Proposed binary method for support vector machine (SVM) classification. Three different classifiers have been used here for binary classification and each group is classified independently and separately. **Figure 6.** Proposed binary method for support vector machine (SVM) classification. Three different classifiers have been used here for binary classification and each group is classified independently and separately.

#### **4. Results and Discussion**

training and testing process respectively.

**4. Results and Discussion**  Quantitative analysis was performed on each cancerous image based on the four prostate cancer tissue groups (Grade 3, Grade 4, Grade 5, and Benign). We implemented the proposed method using MATLAB R2018a. We performed data analysis to analyze the components of the Quantitative analysis was performed on each cancerous image based on the four prostate cancer tissue groups (Grade 3, Grade 4, Grade 5, and Benign). We implemented the proposed method using MATLAB R2018a. We performed data analysis to analyze the components of the nuclei, which were segmented from prostate tissue images.

nuclei, which were segmented from prostate tissue images. In this paper, 400 images were used in total. Of these, 240 were used for training and 160 were used for testing. The number of images considered for each group was 100, and these were classified as malignant vs. benign, Grade 3 vs. Grade 4+5, and Grade 4 vs. Grade 5. Each image was 24-bits/pixel with a size of 512 × 512 pixels. All of the possible results are shown in Tables 2–4, where In this paper, 400 images were used in total. Of these, 240 were used for training and 160 were used for testing. The number of images considered for each group was 100, and these were classified as malignant vs. benign, Grade 3 vs. Grade 4+5, and Grade 4 vs. Grade 5. Each image was 24-bits/pixel with a size of 512 × 512 pixels. All of the possible results are shown in Tables 2–4, where we show the confusion matrices of SVM binary classification for training and testing separately.



**Table 3.** Confusion matrix of SVM binary classification—Grade 3 vs. Grade 4, 5. **Training: 91.7% Testing: 85.0% Train Grade 3 Grade 4+5 Data Test Grade 3 Grade 4+5 Data**  Grade 3 55 5 60 Grade 3 36 4 40 Grade 4+5 5 55 60 Grade 4+5 8 32 40

**Training: 95.0% Testing: 92.5% Train Grade 4 Grade 5 Data Test Grade 4** Grade 5 Data Grade 4 54 6 60 Grade 4 36 4 40 Grade 5 0 60 60 Grade 5 2 38 40

Tables 2–4 show the confusion matrices used to evaluate the performance of machine learning algorithms and the classifiers on a set of train and test data. We have shown these confusion matrix tables to get a better idea about the errors of a classification model. Each one of these tables is divided into two parts to show the correctly classified and misclassified data with respect to the

In Table 5, we used four types of performance metrics, namely, accuracy, sensitivity, specificity, and Matthews's correlation coefficient (MCC). These metrics were calculated using our confusion


**Table 3.** Confusion matrix of SVM binary classification—Grade 3 vs. Grade 4, 5.

**Table 4.** Confusion matrix of SVM binary classification—Grade 4 vs. Grade 5.


Tables 2–4 show the confusion matrices used to evaluate the performance of machine learning algorithms and the classifiers on a set of train and test data. We have shown these confusion matrix tables to get a better idea about the errors of a classification model. Each one of these tables is divided into two parts to show the correctly classified and misclassified data with respect to the training and testing process respectively.

In Table 5, we used four types of performance metrics, namely, accuracy, sensitivity, specificity, and Matthews's correlation coefficient (MCC). These metrics were calculated using our confusion matrices, i.e., true positive (TP), true negative (TN), false positive (FP), and false negative (FN). We multiplied the accuracy by 100% to normalize it with respect to the other measurements. The four types of performance metrics used in Table 5 are explained as follow,

1. Accuracy is measure of the proportion of correctly classified samples.

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100\tag{12}$$

2. Sensitivity is a measure of the proportion of positive correctly classified samples.

$$Sensitivity = \frac{TP}{TP + FN} \times 100\tag{13}$$

3. Specificity is a measure of the proportion of negative correctly classified samples.

$$Specificity = \frac{TN}{TN + FP} \times 100\tag{14}$$

4. Matthew's correlation coefficient (MCC) is the eminence of binary class classification. It is a correlation coefficient between target and predictions.

$$\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{((\text{TP} + \text{FN})(\text{TP} + \text{FP})(\text{TN} + \text{FN})(\text{TN} + \text{FP}))}} \times 100\tag{15}$$

Table 5 shows the classification results of the proposed method for three different groups. The SVM binary classification accuracy, sensitivity, specificity, and MCC for malignant vs. benign are 88.7%, 91.8%, 86.0%, and 70.2%, respectively. For Grade 3 vs. Grade 4+5, the classification accuracy, sensitivity, specificity, and MCC are 85.0%, 81.8%, 88.8%, and 70.3, respectively. For Grade 4 vs. Grade 5, the classification accuracy, sensitivity, specificity, and MCC are 92.5%, 94.7%, 95.0%, and 85.1, respectively.


**Table 5.** Evaluation results and performance metrics for three binary divisions using SVM.

For the purpose of validation, we also performed prostate cancer grading classification using multilayer perceptron (MLP) technique in Weka, shown in Table 6. MLP is a class of feed-forward artificial neural network, which consists of at least three layers of node: an input layer, hidden layer, and an output layer. Each node is a neuron except input nodes and uses a non-linear activation function. MLP utilizes a supervised learning technique like SVM. From the results shown in Tables 5 and 6, we can see that the proposed SVM binary classification works significantly better than MLP, and the highest accuracy obtained was 92.5%, for Grade 4 vs. Grade 5. First, classification was performed to detect cancer in all of the samples in the dataset. The second and third classification was performed within the cancer group for low- and high-grade cancer detection. In Figure 7, the bar graph shows the comparison results for the three different binary divisions that are used for SVM classification.

**Table 6.** Evaluation results for three binary divisions using the multilayer perceptron (MLP) classification technique.


11 of 16

the segmented nucleus and lumen tissue images. Finally, the SVM and MLP classification was

We can see that the results of the comparison between SVM classification accuracy in Table 7 and Figure 8 vary between one-shot and binary classifiers. When we classified our data using multi-class or one-shot classifiers, the classification accuracies for benign, Grade 3, Grade 4, and Grade 5 are 60%, 55%, 85%, and 50%, respectively. Using the proposed binary classification approach, the accuracies for the same groups are 92.5%, 90.0%, 90.0%, and 95.0%, respectively. Comparing both classifiers simultaneously, we can see that the results obtained using the binary classifier are better than those obtained using multi-class or one-shot classifier. Table 8 shows the comparison results of MLP classifier between one-shot and binary classification. After comparing the results between SVM and MLP classification methods, we can say that the proposed method, SVM, achieved better results than MLP. In one-shot classification, the entire dataset is classified into four groups simultaneously. In this case, the errors in one class affect the performance of the others, negatively impacting the classification accuracy. Thus, the model cannot make correct predictions. Whereas, in binary classification, the entire dataset is separated into three groups and each group is classified separately and independently. In this case, the errors in one class do not affect the

performed based on the significant features selected.

performance of the other class.

To predict, automatically, prostate cancer gradings, we used machine learning and deep

To predict, automatically, prostate cancer gradings, we used machine learning and deep learning algorithms such as SVM and MLP, respectively. To do so, we first applied image segmentation as a preprocessing step. Secondly, we converted the images from RGB to binary to carry out watershed segmentation. Thirdly, we calculated a set of morphological features based on the segmented nucleus and lumen tissue images. Finally, the SVM and MLP classification was performed based on the significant features selected.

We can see that the results of the comparison between SVM classification accuracy in Table 7 and Figure 8 vary between one-shot and binary classifiers. When we classified our data using multi-class or one-shot classifiers, the classification accuracies for benign, Grade 3, Grade 4, and Grade 5 are 60%, 55%, 85%, and 50%, respectively. Using the proposed binary classification approach, the accuracies for the same groups are 92.5%, 90.0%, 90.0%, and 95.0%, respectively. Comparing both classifiers simultaneously, we can see that the results obtained using the binary classifier are better than those obtained using multi-class or one-shot classifier. Table 8 shows the comparison results of MLP classifier between one-shot and binary classification. After comparing the results between SVM and MLP classification methods, we can say that the proposed method, SVM, achieved better results than MLP. In one-shot classification, the entire dataset is classified into four groups simultaneously. In this case, the errors in one class affect the performance of the others, negatively impacting the classification accuracy. Thus, the model cannot make correct predictions. Whereas, in binary classification, the entire dataset is separated into three groups and each group is classified separately and independently. In this case, the errors in one class do not affect the performance of the other class.

**Table 7.** Support vector machine (SVM) classifier, comparison between one-shot and binary classification. *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 12 of 16


**Figure 8.** Comparison between support vector machine (SVM) classifiers among the four Gleason grade groups. In the case of one-shot classification, the classifier could not accurately distinguish among the four groups. In the case of binary classification, the classifier was almost always accurate, with little variation. **Figure 8.** Comparison between support vector machine (SVM) classifiers among the four Gleason grade groups. In the case of one-shot classification, the classifier could not accurately distinguish among the four groups. In the case of binary classification, the classifier was almost always accurate, with little variation.

In Table 9, we compare the accuracy of different standard classification methods with our proposed method. The classification accuracy achieved for the class low vs. high grade using the proposed method is higher than other methods described in the literature. On cancer diagnosis, when classified Malignant vs. Benign, our result is better than Nir et al. (2018) and Doyle et al. (2006), but not higher compared to Tabesh et al. (2017), because they used different types of features that are extracted from the tissue image, namely color channel histogram, fractal dimension, fractal code, wavelet, and MAGIC. The authors of Reference [4] computed the features of epithelial nuclei objects in the tissue image, whereas, our method computed the features of all nuclei objects existing in the

Benign 37.5 Benign 87.5 Grade 3 67.5 Grade 3 90.0 Grade 4 45.0 Grade 4 75.0 Grade 5 70.0 Grade 5 77.5 Total 55.5 Total 82.5

**Table 8.** Multilayer perception (MLP) classifier, comparison between one-shot and binary

classification.

biopsy prostate tissue image.

In Table 9, we compare the accuracy of different standard classification methods with our proposed method. The classification accuracy achieved for the class low vs. high grade using the proposed method is higher than other methods described in the literature. On cancer diagnosis, when classified Malignant vs. Benign, our result is better than Nir et al. (2018) and Doyle et al. (2006), but not higher compared to Tabesh et al. (2017), because they used different types of features that are extracted from the tissue image, namely color channel histogram, fractal dimension, fractal code, wavelet, and MAGIC. The authors of Reference [4] computed the features of epithelial nuclei objects in the tissue image, whereas, our method computed the features of all nuclei objects existing in the biopsy prostate tissue image.


**Table 8.** Multilayer perception (MLP) classifier, comparison between one-shot and binary classification.

**Table 9.** Comparison between the proposed method and other standard methods for the classification of prostate cancer gradings.


## **5. Conclusions**

In this study, we have developed a computerized grading system for digitized histopathology images using supervised learning methods. The segmentation process for biopsy tissue image was performed using the k-means algorithm and touching cells were separated using the watershed algorithm. Morphological features were selected for prostate cancer grading and diagnosis. Gaussian and linear kernels were used for the classification of prostate histopathological images. Using these kernels, we observed some improvements in the results, and gradually increased the performance of the model used for training and testing. The parameters of the kernel play a vital role in the classification process, and the best combination of *C* and γ was selected for better classification accuracy. Satisfactory classification results were obtained using the extracted morphological features, and these features were extracted from the sub-images, viewable in 40× magnification. The quantitative analysis described here is remarkably flexible in terms of implementation. The SVM binary classification method presented in this paper is used to classify malignant vs. benign, Grade 3 vs. Grade 4+5, and Grade 4 vs. Grade 5. Our results are satisfactory and comparable with those reported in the literature and produced quantitative measures based on the features extracted from microscopic biopsy tissue images. In order to justify our proposed method, SVM, we also carried out features classification using MLP. One-shot and binary classification results were compared to show the differences in two classifications accuracies. In future studies, we will improve our classification accuracy using the combinations of multiple features. Deep learning and machine learning techniques will be used for comparative analysis, where, image classification will be performed using the convolutional neural network (CNN) and feature classification will be performed using support vector machine (SVM), respectively.

**Author Contributions:** Conceptualization, S.B., H.-G.P. and N.M.; Formal analysis, S.B., D.P., J.-H.S. and N.-H.C.; Methodology, S.B.; Project administration, C.-H.K.; Resources, H.-G.P., C.-H.K. and N.-H.C.; Supervision, H.-K.C.; Validation, S.B.; Visualization, N.M., J.-H.S., N.-H.C. and H.-K.C.; Writing—original draft, S.B.; Writing—review & editing, N.M.

**Funding:** This research was funded by the Ministry of Trade, Industry, and Energy (MOTIE), Korea, grant number (R&D, P0002072).

**Acknowledgments:** This research was financially supported by the Ministry of Trade, Industry, and Energy (MOTIE), Korea, under the "Regional Specialized Industry Development Program (R&D, P0002072)" supervised by the Korea Institute for Advancement of Technology (KIAT).

**Ethical Approval:** All subjects provided written informed consent for their participation in the study, which was approved by the Institutional Ethics Committee at College of Medicine, Yonsei University, Korea (IRB no. 1-2018-0044).

**Conflicts of Interest:** The authors declare that they have no conflicts of interest.

### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Applications of Capacitive Imaging in Human Skin Texture and Hair Analysis**

#### **Christos Bontozoglou 1,\* ,† and Perry Xiao 1,2,\* ,†**


Received: 1 August 2019; Accepted: 24 December 2019; Published: 29 December 2019

**Abstract:** This article focuses on the extraction of information from human skin and scalp hair for evaluation of a subject's condition in the cosmetic and pharmaceutical industries. It uses capacitive images from existing hand-held research equipment and it applies image processing algorithms to expand their possible applications. The literature review introduces the readers into the field of skin research, and it highlights pieces of information that can be extracted by in vivo skin and ex vivo hair measurements. Then, the selected scientific equipment is presented, and Maxwell-based electrostatic simulations are employed to evaluate the measurement apparatus. Image analysis algorithms are suggested for (a) the detection of polygons on the human skin texture, (b) the estimation of wrinkles length and (c) the observation of hair water sorption capabilities by capacitive imaging systems. Finally, experiments are conducted to evaluate the performance of the presented algorithms and the results are compared with the literature. The results indicate that capacitive imaging systems can be used for skin age classification, detection and tracking of skin artifacts (e.g., wrinkles, moles or scars) and calculation of water content in hair samples.

**Keywords:** texture; skin microrelief; water sorption; aging; hair

### **1. Introduction**

The electrical properties of skin and hair alongside their texture and anatomy provide information about a person's health, efficiency of drug delivery and effects from application of cosmetic products. For these reasons, scientists from a variety of fields use non-invasive instruments to extract such information and achieve experimental results that strengthen their research. As part of this introduction, selected publications are illustrating the above points right after a summary of human skin structure and hair anatomy. Then, the use of capacitive imaging in various research fields is demonstrated by summarizing selected research in the literature.

The skin is the largest human organ in terms of surface area and its thickness varies from 5 µm to 1 mm (or more) in different areas of the body [1]. It protects internal organs from environmental influences, and it regulates the water body loss. As illustrated in Figure 1 (left), the skin is separated into three main layers: epidermis, dermis and subcutis. Among other functions, the epidermis provides chemical and diffusion protection, the dermis protects from external mechanical forces and the subcutis connects the skin to the underlying tissue [1]. The outermost sublayer, target of non-invasive instruments, is called stratum coreum and consists mainly of dead keratinocytes. The textural information of this layer, or skin microrelief, is affected by the internal body health, age and living habits as well as by environmental influences [2]. The visible part of hair (Figure 1 right) is shaped as a three-layered shaft of dead protein filament: the cuticle, cortex and medulla [3]. The cuticle is a thin surrounding layer of roof tiles-like structures with about five degrees inclination from the hair

shaft core. The cortex is the thickest layer of the hair, it consists of tightly packed keratin cells and it is responsible for the hair color as well as most of the water holding capabilities. Last, the medulla is a soft unstructured keratin that forms the core in the thicker human hair (e.g., scalp hair) [4].

**Figure 1.** Human skin [5] and hair anatomy [6].

In the literature, a variety of methods have been used to extract information from the human skin surface. Zhang et al. [7] used a capacitive imaging system to measure solvent concentration and penetration in human skin to demonstrate how such a system can support skin clinical trial studies. In the same work, skin damage by intense washing, tape stripping and SLS irritation was characterized. In the area of skin aging analysis, Corcuff et al. [8] studied the skin furrow response during arm extension based on image analysis of negative skin replicas. They provided evidence that younger people can buffer skin strain between primary and secondary line orientations while elderly subjects tend to have furrows only in one orientation that rotate during extension. In more recent years, Zahouani et al. [9] achieved better classification between primary and secondary skin lines by measuring their depth using three-dimensional confocal microscopy. Experimental results on 120 Caucasian women confirmed that secondary lines fade with age while the depth of primary lines increases. A different approach of evaluating skin aging was employed by [10,11], where the surface area of polygons shaped between skin furrow was measured and it was found to associate well with subjects' chronological age.

As illustrated for the area of skin research, hair samples are also analyzed by scientists to detect health and cosmetic conditions. Wosu et al. [12] reviewed 39 studies that associate hair cortisol concentrations to stress psychiatric symptoms and disorders. This approach was found to be more accurate in detecting long-term stress because cortisol concentration measurements from blood, urine and saliva samples are influenced by living habits and environment conditions. Furthermore, Kristensen et al. [13] used hair samples from 266 women to determine that hair dyeing and frequent washing does not affect cortisol measurements. In a different scientific field, Boll et al. [14] employed ATR FT-IR spectroscopy to differentiate between dyed and undyed hair. Such information can be used in forensic hair analysis, given their static classification was found to be 98.1 ± 3.0% accurate in detecting whether a sample is dyed or not, but also in identifying the brand and the color of the dye. In the field of cosmetic science, Barba et al. [15] used thermogravimetric methodology to measure the water content of hair and to assess hair damage from bleaching and straightening. They found that bleached hair shown 3.9% reduced water holding capabilities while the straighten hair 9.5%.

In this work, a capacitive imaging system is used to extract information from skin and hair samples. To the best of our knowledge, Lévêque and Querleux [16] first used this technology for human skin characterization and surface hydration mapping in 2003. They used a fingerprint system to measure the distribution of skin surface capacitance in different body sites, to detect main microrelief orientation and density as well as to support the system's usability in the field of skin research by performing side-by-side experiments with Corneometer CM812. Later on, Batisse et al. [17] demonstrated advantages of such technology over other skin hydration apparatuses by pointing out the importance of visually observing the contact of a capacitive sensor with the skin during lips moisturization and volar forearm inflammation experiments. Since then, capacitive imaging systems have been used in order to examine various skin conditions e.g., mapping of psoriasis and acne lesions, assessment of sun exposure, skin aging, damage or irritation [18–22]. Furthermore, it is worth mentioning successful attempts to improve the capacitive imaging technology. Bevilacqua and Gherardi [23] achieved depth profiling up to 50 µm by fusing the image with pressure information monitored by a subsystem attached on the back of the measurement apparatus. Also, Huang et al. developed a wearable capacitive imaging system using an "ultrathin, stretchable sheets with arrays of embedded impedance sensors for precise measurement and spatially multiplexed mapping" [24].

In the following sections, the measurement principle of capacitive imagining for non-invasive skin and hair measurements is analyzed. Then, image processing algorithms are suggested to extract quantitative values from skin and hair samples. Finally, three experiments are conducted to evaluate both the imaging system but also the integrity of the selected algorithms.

#### **2. Materials and Methods**

#### *2.1. Measurement Apparatus*

To capture hydration information in pharmaceutical and cosmetic industries non-invasive capacitance sensing systems are employed. In the field of skin research, the most widely known single pixel capacitive system is the Corneometer CM825 (Courage+Khazaka, Cologne, Germany) [25]. This instrument has a 7 × 7 mm sensor in interdigital electrode geometry with 250 µm spatial wavelength and 50 µm electrode width (Figure 2). Its sensor is covered with a 20 µm thick layer of glass for galvanic contact protection [26]. The capacitive imaging sensors are based on the same technology with Corneometer and they are widely known as fingerprint sensors. Their sensing surface consists of a two-dimensional array of miniaturized capacitive pixels that capture the hydration map of the sample. In this work the Epsilon E100 (Biox Systems Ltd., London, UK) [27] is used. It has a sensing surface of 12.8 × 15 mm filled with 300 × 256 pixels and a 2 µm coat of silicon dioxide for galvanic contact protection. The accompanying software exports hydration information in a format of color-coordinated images. In those images, brighter pixels denote higher while darker pixels denote lower hydration levels.

**Figure 2. Left**, 3D representation of scalar potential for Corneometer's electrodes as produced using Agros2D [28]. **Right**, representation of the fingerprint sensor pixels embedded in Epsilon E100 [27].

#### *2.2. Measurement Depth Simulation*

In capacitive measurements, the image resolution and sensitivity are important as well as the measurement depth. For in vivo skin measurements, the penetration depth of the electric field should not exceed the thickness of stratum corneum, which consists of dead and keratinized cells [1]. Otherwise, the pixel readouts will saturate because of the higher conductance in deeper skin layers. By contrast, in hair measurements the depth should be sufficient to reach the medulla to capture information from all layers in the hair shaft. In this work, the measurement depth is defined as the distance from the sensor surface for which the electric potential drops by 97%. To the best of our knowledge, this comes in agreement with experimental and theoretical results achieved by Huanyu Cheng et al. [29], where the system capacity is used as a reference instead of the electric potential. Of course, in either approach, the system response changes with the hydration level of the sample, so the penetration depth of the electric field varies accordingly. In order to evaluate the performance of Epsilon E100, the penetration depth for different insulating materials is simulated using Maxwell's equations for both E100 and CM825 and the results are compared.

A property of electromagnetics is that Maxwell's equations are linear. As a result, we can refer to an arbitrary distribution of charges by simple addition of the relevant contributions. In the case of interdigital sensors, such as Corneometer, for the charge distribution residing on a strip (*x*1, *x*2), the electric potential at an arbitrary point (*x*, *y*) is given by the expression in Equation (1). The simulation defined 6 driving and 5 sensing rectangular electrodes in the dimensions provided by CM825 literature. The electric potential was calculated for points on a line segment from the surface of the middle driving electrode to 60 µm perpendicular displacement.

$$\Phi(\mathbf{x}, y) = \lambda k\_{\varepsilon} (\sinh^{-1} \frac{\mathbf{x}\_2 - \mathbf{x}}{|y|} - \sinh^{-1} \frac{\mathbf{x}\_1 - \mathbf{x}}{|y|}) \tag{1}$$

where:

Φ = the electric potential

*λ* = the surface charge density and

*k<sup>e</sup>* = the dielectric of the insulating material under examination (*ke*0)

In the case of annular or disk sensors, such as Epsilon E100, Equation (2) calculates the electric potential in any given point perpendicular to the center of the electrode. For the purpose of this simulation, one driving electrode disk and one sensing electrode ring were used. The electric potential was measured from the center of the sensor surface to 60 µm distance on a perpendicular line segment.

$$\Phi(z) = \frac{\lambda}{2k\_{\varepsilon}} (\sqrt{\mathcal{R}\_{out}^2 + z^2} - \sqrt{\mathcal{R}\_{in}^2 + z^2}) \tag{2}$$

where:

*Rout* = the outer electrode radius


The hydration level of the samples is simulated by changing the dielectric permittivity parameter (*ke*). The latter ranges from 1 to 80.1 in 20 ◦C, where higher value denotes more hydrated samples. For both instruments, the dielectric permittivity of the protective layers is assumed 3.9 and the simulation is repeated for samples with dielectric permittivity of 80, 7 and 3.9. The simulation results in Figure 3 show the Corneometer CM825 measurement depth ranges from 10 to 40 µm, while for Epsilon E100 from 5 to 22 µm. Assuming that stratum corneum thickness is between 10 µm and 40 µm, the results validate that both instruments are appropriate for human skin measurements with Corneomenter with greater penetration depth. On hair measurements, assuming sample thickness 70–150 µm, both instruments have insufficient penetration depth but still reach fraction of the cortex layer.

**Figure 3.** Maxwell-based simulation results to compare measurement depth between Corneometer CM825 and Epsilon E100. The end of the protective layers is overlapping with the y-axis (x = 0). Therefore, electric potential readouts at the left of the y-axis (x < 0) are inside the protective layer, while readouts at the right of the y-axis (x > 0) are taken from within the sample under examination. Also, the electric potential on the electrode surface is configured to 0.3 V, but the upper part is cropped for better visualization of the results. The horizontal line at 0.09 V represents the 3% depth threshold.

#### *2.3. Analysis of Capacitive Image*

In this section, image processing algorithms are applied on skin and hair capacitive images to demonstrate how texture and color information can be used in cosmetic and pharmaceutical industries. More specifically, algorithms are presented: (a) to detect the average surface area of skin polygons and associate the results with the chronological age of the subject, (b) to estimate the length of skin furrows and track progression over the course of time and (c) to calculate the water content and loss rate in human hair.

#### 2.3.1. Skin Polygons Detection

The human skin texture is highly inhomogeneous, and it worsens with age because the elasticity fibers network is weakening and the strain is not buffered efficiently [8,9]. Therefore, in order to detect the skin polygons across a wide range of age groups, a segmentation algorithm is required that focuses only on pixels in contact with the skin, it suppresses wrinkles until they become sizeless and it does

not need predefined seeds. According to [30], segmentation algorithms may be separated into two categories, the graph-based and gradient-based segmentation. In graph-based segmentation, weights are calculated for each pair of neighboring pixels and superpixels are identified by minimizing a cost function. As it is illustrated in Figure 4, graph-based segmentation algorithms are not fit for this application because they do not suppress skin furrow.

**Figure 4.** Example application of graph-based segmentation algorithm on capacitive image from volar wrist area. Left the original and right the segmented frame. For this demonstration, algorithm developed by Felzenszwalb and Huttenlocher is used [31].

To the best of our knowledge, Bevilacqua A. and Gherardi A. [18] first apply the gradient-based segmentation algorithm by Vincent and Soille [32] on skin capacitive images. In order to cope with pixel noise and skin hydration inhomogeneity, the frames are pre-processed with a normalization filter. This filter controls the over-/under-segmentation of the sample but given the vast skin texture variety between subjects and sites a global configuration for absolute polygons' count is not feasible. Thus, the reliability of this approach is experimentally strengthened by calculating the correlation between polygons density and subject age and comparing the results with these reported in the literature using manual polygons' count [33]. An example output of this segmentation algorithm is shown in Figure 5.

**Figure 5.** Example of gradient-based segmentation algorithm applied on capacitive image from the palm thenar. Right, the original and left the segmented frame. The lines that overlap with skin furrows represent the sizeless boundaries between segments. Reproduced with permission from [33], John Wiley and Sons, 2018.

#### 2.3.2. Feature Length Estimation

Lévêque and Querleux [16] applied the Gray Level Co-Occurrence Matrix (GLCM) on capacitive images to calculate the angle between primary and secondary skin lines in the area of volar forearm and estimate the age of the subject. While this measurement apparatus has shown poor performance in calculating the skin anisotropy index compared to 3D confocal microscopy due to lack of depth profiling, it does not restrict the application on 2D feature extraction [33]. The GLCM is calculated by Equation (3) [34].

$$P(i,j,d,\theta) = \sum\_{\mathbf{x}=\mathbf{0}}^{n} \sum\_{y=0}^{m} \delta\_{iI\_{\mathbf{1}}} \delta\_{jI\_{\mathbf{2}}} \tag{3}$$

where:

*i*, *j* = Greyscale level *δkI* = Kronecker delta *I*<sup>1</sup> = I(x,y) *I*<sup>2</sup> = I(x + dcos*θ*, y + dsin*θ*)

In this study, the greyscale levels of the image are reduced from 256 to three (i.e., below, target and over), with the drawback of reducing classification capabilities [35], to strengthen the correlation between pixels within the target level. The length of the target feature *dlength* towards any discrete orientation *θtarget* is estimated by the displacement value *d* for which ∇*P* approaches zero (*dlength* = *d*∇*P*→<sup>0</sup> ). The feature length is converted from pixels to SI units using the sensor DPI from the instrument's specification (i.e., 50 µm). Of course, selecting a region of interest to eliminate effects from neighboring areas and using repositioning algorithms between frames to track the same wrinkle over the course of time are required [7].

#### 2.3.3. Hair Water Content

In previous work [7], the skin water content percentage and solvent concentration are calculated from capacitive images using Equation (4). In that study, a normalized cross-correlation algorithm detects the same skin area across measurements taken before and after the solvent application on the skin, which allows measurement of the solvent concentration in skin more accurately. In order to apply the same equation on hair samples, the repositioning algorithm is replaced with a range threshold [36]. In this way, only the pixels in contact with hair samples are isolated increasing the accuracy of water percentage calculation. Figure 6 demonstrates how this range threshold excludes pixels with bad contact on a single frame of hair sample.

$$\text{Water Content}[\%] = 100 \frac{\varepsilon\_m - \varepsilon\_{dry}}{\varepsilon\_{water} - \varepsilon\_{dry}} \tag{4}$$

where *em*, *edry* and *ewater* the dielectric permittivity of the sample, the dry sample and this of deionized water correspondingly.

**Figure 6.** Example of threshold application on hair capacitive image. On the left the original frame, where acclimatized hair on bright green and air/bad contact on darker green shades. Right, the software output after a range threshold is applied to exclude pixels with bad contact before calculating water content.

#### **3. Results**

In this section, three experiments are conducted to examine the performance of the presented algorithms on capacitive images. The first experiment evaluates Vincent and Soille segmentation algorithm to automatically count the skin polygons, i.e., the skin areas shaped between wrinkles. For this purpose, capacitive images were recorded from 12 volunteers aging from 12 to 74 years old. The samples were taken from the middle volar forearm area while the arm was in resting position to reduce strain. Then, the segmentation algorithm was applied using Epsilon E100 software and the average number of polygons per square millimeter was correlated against the subjects' age. The results in Table 1 demonstrate that the average number of polygons per surface area decreases with age. The calculated correlation (−0.71) comes in agreement with previous studies in the literature [10,11].

**Table 1.** Experimental results for skin polygons per surface area in middle volar forearm across 12 volunteers in different age groups using capacitive images. The correlation of the average number of polygons per mm<sup>2</sup> against subjects age is calculated to <sup>−</sup>0.71. Reproduced with permission from [33], John Wiley and Sons, 2018.


The second experiment consists of a short comparison between C-Cube, a calibrated digital spectroscope (Pixience, Toulouse, France) [37], and Epsilon E100 in feature length estimation. The same skin area of volar forearm was captured with both instruments and three furrows were randomly selected (Figure 7). C-Cube software provides the length measurement as a default feature by drawing the linear segment of interest on the captured frame (Figure 7 left). Epsilon E100 does not provide such feature, so the region of interest was cropped, and our length estimation algorithm was applied (Figure 7 right). The results in Table 2 suggest that if there are no neighboring artifacts in capacitive images, such systems can calculate the length of a furrow with good accuracy.

**Figure 7.** Area of volar forearm captured with C-Cube spectroscope (**left**) and Epsilon E100 capacitive imaging system (**right**). R2-4 the three randomly selected furrows for the comparative experiment.

**Table 2.** Results of comparative study between C-Cube spectroscope and Epsilon E100 to examine accuracy of wrinkle length estimation using GLCM. The length of three wrinkles in the area of volar forearm was compared and the correlation between the two measurement methods is calculated to 0.9.


In the final experiment, the ability of capacitive imaging sensors to measure hair water content and desorption rate are examined. For this purpose, scalp hair samples from three volunteers were washed in deionized water and dried before left to acclimatize overnight in three different humidity chambers. Saturated salt solutions adjusted the relative humidity levels while both temperature and relative humidity were logged every 10 s using SHT35 by Sensirion [38]. The selected salts are potassium nitrate (85% RH), sodium chloride (75% RH) and magnesium nitrate (67% RH). After acclimatization, the samples were moved in 21 ◦C & 35% RH conditions, side by side, and they were held against the sensor surface with a plastic plug provided by the manufacturer. The system was capturing video frames until the water loss rates reached a flat state or until the video exceeded 5.5 h. In order to target only the pixels in contact with hair, a range filter from 3.5 to 80 was applied on each frame. The selection of these limits is based on previous work with Epsilon E100 [39]. Five video instances from the same hair sample per acclimatization chamber are shown in Figure 8.

Two observations are made from the experimental results in Figure 9 and their summary in Table 3. First, the hair water content right after acclimatization correlates well with the relative humidity of the chamber. Second, the hair samples from younger subjects tend to hold water for a longer period of time. The latter comes in agreement with Xiao P. et al. [40], stating that lower diffusion rates are observed in younger subjects meaning better water holding capabilities. Note that in many occasions the sample never reached the expected baseline. In those cases, the lowest water content readout was used in the calculations for Table 3.

**Table 3.** Results of scalp hair water loss experiment using Epsilon E100. The left side of the table shows that the water content % correlates well with the %RH in the acclimatization chamber. The right side of the table shows how long it takes for the sample to lose 75% of its initial water content.


**Figure 8.** Video snapshots from water desorption in hair samples. The first row shows five frames from hair capacitive images over time after acclimatization in 67%RH. Rows two and three show the same sample after acclimatization in 75% and 85%RH chambers correspondingly. The contrast is modified to highlight hair samples.

**Figure 9.** Hair water content desorption curves using samples from three volunteers and for three different humidity acclimatization chambers.

#### **4. Discussion**

In this study, we achieved to summarize the importance of skin and hair analysis in a variety of scientific fields, we introduced and analyzed the apparatus of capacitive imaging systems using a Maxwell-based simulation, we suggested algorithms for information extraction using such equipment and we conducted experiments to evaluate the overall system performance.

The simulation compared the penetration depth of the electric field between Corneometer C825 and Epsilon E100. The results show that both instruments have satisfactory penetration depth for skin measurements, with CM825 reaching twice the measurement depth. This implies that if stratum corneum thickness is less than 40 µm, errors should be found in a side-by-side comparative study. Such results have not been found in the literature and might be of interest to achieve in future work for validation of our simulation. Another conclusion we draw by this simulation is that both instruments have insufficient measurement depth for hair analysis. Their electric field reaches the shaft cortex and it will give some reasonable readouts, but it could not represent the absolute hair water content.

Our experiments focused on the evaluation of capacitive imaging systems in cosmetic and skin research studies. The first experiment demonstrated how capacitive images can be used to extract skin texture information. While this could be performed with any calibrated spectroscope, expanding applications of existing laboratory equipment is one of our goals. The suggested algorithm successfully detected the skin polygons and measured their average surface area. The results associate well with subjects' age, giving −0.71 correlation with <0.0006 statistical significance. Furthermore, the achieved correlation value agrees with [10,11], where the correlation between age and polygons density were calculated to −0.64 and −0.65 in the dorsal hand and volar forearm correspondingly.

The second experiment used the GLCM to estimate the length of wrinkles. In order to determine the reliability of this method, we compared our results with calibrated spectroscopy for a small group of samples. The experiment was not extended further because the need to bring the capacitive sensor in contact with the sample results to skin deformation. This is enough to twist the frame and make repositioning algorithms to fail identifying the same wrinkle. Nevertheless, the same logic could be applied on objects with greater surface area (e.g., moles or scars) and track changes in their dimensions over time.

Our last experiment focused on measuring water desorption rate from scalp hair samples. The experimental results shown that the measurement apparatus is capable of differentiating desorption rates from young and elder subjects. This means that the comparative interpretation of the results between samples are in agreement with the literature [40], indicating that such systems can be used in hair analysis studies. Unfortunately, the observed desorption rates are lower than the ones reported in similar studies using different measurement methods (e.g., DVS or thermogravimetric) and for many of our samples the expected baseline was not reached. More specifically, Xiao et al. [40] found that it takes only 58 min for soaked hair samples to return to their baseline hydration using DVS.

To conclude, we believe that capacitive imaging sensors can be used for skin texture analysis and human skin age classification. We also believe that evidence is found for capacitive imaging application on hair water loss studies. This will require a sensor with greater penetration depth and a better sample-holding mechanism.

**Author Contributions:** Conceptualization, C.B. and P.X.; Methodology, C.B. and P.X.; Software, C.B.; Validation, C.B. and P.X.; Formal Analysis, C.B.; Investigation, C.B.; Resources, P.X.; Data Curation, C.B.; Writing—Original Draft Preparation, C.B.; Writing—Review & Editing, C.B. and P.X.; Visualization, C.B.; Supervision, P.X.; Project Administration, P.X.; Funding Acquisition, P.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article*

## **Use of Texture Feature Maps for the Refinement of Information Derived from Digital Intraoral Radiographs of Lytic and Sclerotic Lesions**

**Rafał Obuchowicz 1,\*, Karolina Nurzynska <sup>2</sup> , Barbara Obuchowicz <sup>3</sup> , Andrzej Urbanik <sup>1</sup> and Adam Piórkowski <sup>4</sup>**


Received: 17 May 2019; Accepted: 19 July 2019; Published: 24 July 2019

**Abstract:** The aim of this study was to examine whether additional digital intraoral radiography (DIR) image preprocessing based on textural description methods improves the recognition and differentiation of periapical lesions. (1) DIR image analysis protocols incorporating clustering with the k-means approach (CLU), texture features derived from co-occurrence matrices, first-order features (FOF), gray-tone difference matrices, run-length matrices (RLM), and local binary patterns, were used to transform DIR images derived from 161 input images into textural feature maps. These maps were used to determine the capacity of the DIR representation technique to yield information about the shape of a structure, its pattern, and adequate tissue contrast. The effectiveness of the textural feature maps with regard to detection of lesions was revealed by two radiologists independently with consecutive interrater agreement. (2) High sensitivity and specificity in the recognition of radiological features of lytic lesions, i.e., radiodensity, border definition, and tissue contrast, was accomplished by CLU, FOF energy, and RLM. Detection of sclerotic lesions was refined with the use of RLM. FOF texture contributed substantially to the high sensitivity of diagnosis of sclerotic lesions. (3) Specific DIR texture-based methods markedly increased the sensitivity of the DIR technique. Therefore, application of textural feature mapping constitutes a promising diagnostic tool for improving recognition of dimension and possibly internal structure of the periapical lesions.

**Keywords:** digital intraoral radiography; image preprocessing; periapical lesions; texture analysis

## **1. Introduction**

The importance of assessment of periapical lesions in clinical decision-making is well known. Osteolytic lesions form as periapical lesions in response to inflammatory infiltrates and are often associated with morbidity of the root canal pulp [1–4]. Recognition of osteolytic changes provides important information about the viability of a tooth, which influences decision-making during the treatment process. The relevant anatomical structures themselves are often small, which hinders acquisition of adequate anatomical outline. Moreover, the relative complexity of the region is increased by the presence of superimposing structures that result in "anatomical noise". All of these factors contribute to the difficulty in recognition of bone resorption on radiographic images, which impedes

the accuracy of diagnosis using digital intraoral radiography (DIR) images and may result in periapical lesions going undetected or detection is inadequate [5–7].

All of the above-mentioned factors contribute to the relatively low sensitivity of lesion detection of 70% reportedly associated with DIR images, which is markedly less than that of cone-beam computed tomography (CBCT) [8]. While visualization of lesions on CBCT is superior to that on DIR, the radiation dose the patient is exposed to via CBCT is considerably higher than that associated with conventional radiography. Therefore, use of CBCT is questionable, especially as a follow-up modality. In the current work, we present how effective DIR image analyses are with the use of image post-processing in order to refine the acquired information.

Texture feature analysis was first used to evaluate the structure of osteoporotic bone [9–11], where fractal dimension and 13 Haralick features were used for osteoporosis classification on mandibular X-ray images [9]. Similar techniques were applied to analyze periapical bone loss [12–15], where the Gray-Level Co-occurrence Matrix and Fractal Brownian Motion Model were used for bone-loss area detection [15] and localization [12]. Numerous features formed the basis for segmentation [13] and bone loss degree measurement [14]. Other applications of texture analysis were used for periapical bone healing [16–18], where radiological assessment of treatment effectiveness of guided bone regeneration was measured. The most similar research to the presented is considered in [19,20], where the authors tried to detect the type of cyst using the Gray-Level Co-occurrence Matrix and its related properties. By using textural analysis to enhance bone representation derived from DIR images, trabecular structure may be depicted more informatively, and the shapes of different anatomical structures may be determined more accurately. Such techniques may also facilitate more precise determination of changes in the periapical bone region.

The aim of the present study was to examine the applicability of different texture analysis techniques to radiographic dental images for the refinement and possible differentiation of periapical lesions.

#### **2. Materials and Methods**

#### *2.1. Ethics Approval and Consent to Participate*

The study protocol was designed in accordance with the guidelines of the Declaration of Helsinki and the Good Clinical Practice Declaration Statement. Particular care was taken to ensure the safety of personal data, and all images were anonymized before processing. Written informed consent for the publication of clinical details and anonymized clinical images was obtained from the scientific committee and management department of the dental clinic. The usual requirement for informed consent from patients was waived in view of the retrospective nature of the research.

#### *2.2. Experiment Overview*

The experiment described in this work consisted of three main parts:


post-processing stage.

3. Results evaluation—finally the data was revised by experienced radiologists whose statements were the basis for the assessment of results. 3. Results evaluation—finally the data was revised by experienced radiologists whose statements were the basis for the assessment of results.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 3 of 14

in the preprocessing step) the standard methods for its improvement were applied in the

A detailed description of these three parts of the experiment is given below and is presented in Figure 1. A detailed description of these three parts of the experiment is given below and is presented in Figure 1.

**Figure 1.** Image processing system schema. DIR, digital intraoral radiography. ROI, region of **Figure 1.** Image processing system schema. DIR, digital intraoral radiography. ROI, region of interest.

#### interest. *2.3. Dataset Description*

*2.3. Dataset Description*  Sixty-five anonymized DIR images from patients who attended the dental clinic from 2015 to 2017 were used in the study. Sixty-five dental DIR images, consisting of 35 images showing lytic lesions and 30 showing sclerotic lesions, were subjected to analysis. Radiographic material of patients of both sexes aged 26–57 years was used in the study. The images were selected from the Sixty-five anonymized DIR images from patients who attended the dental clinic from 2015 to 2017 were used in the study. Sixty-five dental DIR images, consisting of 35 images showing lytic lesions and 30 showing sclerotic lesions, were subjected to analysis. Radiographic material of patients of both sexes aged 26–57 years was used in the study. The images were selected from the institutional picture archiving and communication system (PACS). The selection criteria were acceptable image quality and suspicion of the presence of a periapical lesion on DIR.

institutional picture archiving and communication system (PACS). The selection criteria were acceptable image quality and suspicion of the presence of a periapical lesion on DIR. Periapical radiographs were obtained using a dental X-ray system (Carestream Trophy with RVG 5200, Kodak, Rochester, NY, USA). Digital images were acquired at 70 kVp and 7 mA with a mean exposure time of 0.05 s, image dimensions of 1200 × 1600 pixels, and a pixel size of 0.018 mm. Digital images were saved in 16-bit digital imaging and communications in medicine (DICOM) Periapical radiographs were obtained using a dental X-ray system (Carestream Trophy with RVG 5200, Kodak, Rochester, NY, USA). Digital images were acquired at 70 kVp and 7 mA with a mean exposure time of 0.05 s, image dimensions of 1200 × 1600 pixels, and a pixel size of 0.018 mm. Digital images were saved in 16-bit digital imaging and communications in medicine (DICOM) format in the local PACS.

#### format in the local PACS. *2.4. Texture Feature Map Computation*

*2.4. Texture Feature Map Computation*  The original DIR images were analyzed using OsiriX (Pixmeo) on a Mac OS-based platform. The DIR images underwent texture preprocessing in the MATLAB environment (MathWorks, Natick, MA, USA) on Windows. The clustering of image colors was implemented using the clustering with a k-means approach (CLU). The co-occurrence matrices (COM) [21], first-order features (FOF), gray-tone difference matrices (GTDM) [22], run-length matrices (RLM) [23], and local binary patterns (LBP) [24,25] were applied. The details of the texture methods are described in [26]. Most of the mentioned methods (COM, FOF, GTDM, and RLM) in the original version compute several features to describe the whole image content. However, such an approach would not be useful for diagnostic purposes, so a new image (called a "texture feature map") reflecting the feature values calculated in a small region of interest was computed. As a consequence, several texture feature maps (depending on the number of features designed for each texture operator) were generated using this technique. In the current study, each texture feature map was generated using the "moving window" approach where, for each pixel, the new feature value was calculated on the basis of data collected in a square window (with sides of an odd number of pixels in length), to ensure that the considered pixel is in the center. In the current study, a 21 × 21 pixel square was used. This size achieves a consensus between computational overhead resulting in image processing time (which grows exponentially with the size and statistical stability of the results, where 441 elements The original DIR images were analyzed using OsiriX (Pixmeo) on a Mac OS-based platform. The DIR images underwent texture preprocessing in the MATLAB environment (MathWorks, Natick, MA, USA) on Windows. The clustering of image colors was implemented using the clustering with a k-means approach (CLU). The co-occurrence matrices (COM) [21], first-order features (FOF), gray-tone difference matrices (GTDM) [22], run-length matrices (RLM) [23], and local binary patterns (LBP) [24,25] were applied. The details of the texture methods are described in [26]. Most of the mentioned methods (COM, FOF, GTDM, and RLM) in the original version compute several features to describe the whole image content. However, such an approach would not be useful for diagnostic purposes, so a new image (called a "texture feature map") reflecting the feature values calculated in a small region of interest was computed. As a consequence, several texture feature maps (depending on the number of features designed for each texture operator) were generated using this technique. In the current study, each texture feature map was generated using the "moving window" approach where, for each pixel, the new feature value was calculated on the basis of data collected in a square window (with sides of an odd number of pixels in length), to ensure that the considered pixel is in the center. In the current study, a 21 × 21 pixel square was used. This size achieves a consensus between computational overhead resulting in image processing time (which grows exponentially with the size and statistical stability of the results, where 441 elements used to fill a histogram of 256 bins is sufficient to achieve statistically reliable results) and image quality. For the LBP texture operator, the radius, R, was in a range from 3 to 15 pixels and there were 8 samples taken in the circular

neighborhood in the described experiments. In the case of clustering, from 10 up to 50 clusters were considered. The RLM calculated the matrix for images quantized to 32 and 64 colors, while the matrix stored the maximal length of 10 elements in a run. The square neighborhood applied to calculate the GTDM matrix was evaluated for sides 3, 7, and 11, yet the smallest one returned the best result. Some examples of texture feature maps achieved with the techniques described here are presented in Figure 2. The DICOM images store shade information using 12 bits, the aim of which is to save as much detailed information in the scanned data as possible, while most algorithms for image processing are used to work with gray-scale images coding the information on 8 bits. Therefore, in order to process the data, the depth of the color was reduced, and the images were converted to 8-bit color coding. This operation makes it possible to compute the texture feature maps with the standard approach to texture processing and has been proven to remove some of the noise [27].

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 4 of 14

used to fill a histogram of 256 bins is sufficient to achieve statistically reliable results) and image quality. For the LBP texture operator, the radius, R, was in a range from 3 to 15 pixels and there were 8 samples taken in the circular neighborhood in the described experiments. In the case of clustering, from 10 up to 50 clusters were considered. The RLM calculated the matrix for images quantized to 32 and 64 colors, while the matrix stored the maximal length of 10 elements in a run. The square neighborhood applied to calculate the GTDM matrix was evaluated for sides 3, 7, and 11, yet the smallest one returned the best result. Some examples of texture feature maps achieved with the

#### *2.5. Pre- and Post-Processing of Images* When an image is of low quality, particularly when it lacks sharpness and contrast, the histogram equalization operation can be applied to ensure that the whole color range is used,

texture feature map quality improved significantly.

*2.5. Pre- and Post-Processing of Images* 

The DICOM images store shade information using 12 bits, the aim of which is to save as much detailed information in the scanned data as possible, while most algorithms for image processing are used to work with gray-scale images coding the information on 8 bits. Therefore, in order to process the data, the depth of the color was reduced, and the images were converted to 8-bit color coding. This operation makes it possible to compute the texture feature maps with the standard approach to texture processing and has been proven to remove some of the noise [27]. thereby rendering objects more easily visible, as shown in Figure 2b. However, this approach is prone to failure when used on images that contain both very dark and very light objects, as is often the case with plain radiographic images. Hence, there is a need for more sophisticated methods to depict the data in a more informative manner. Nevertheless, application of the histogram equalization (HEQ) method in the preprocessing stage (e.g., before the texture feature map is computed) was tested, and when followed by the RLM technique, proved to be useful because the

**Figure 2.** Various aspects of image preprocessing. (**a**) Original data. (**b**) The same image after histogram equalization shows improved contrast but does not depict the structure clearly in the **Figure 2.** Various aspects of image preprocessing. (**a**) Original data. (**b**) The same image after histogram equalization shows improved contrast but does not depict the structure clearly in the tooth region. (**c**–**f**) Examples of feature maps derived from a digital intraoral radiographic image via various texture analysis methods. CLU, clustering with k-means approach; DIR, digital intraoral radiography; FOF, first-order features; GTDM, gray-tone difference matrices; LBP, local binary patterns.

When an image is of low quality, particularly when it lacks sharpness and contrast, the histogram equalization operation can be applied to ensure that the whole color range is used, thereby rendering objects more easily visible, as shown in Figure 2b. However, this approach is prone to failure when used on images that contain both very dark and very light objects, as is often the case with plain radiographic images. Hence, there is a need for more sophisticated methods to depict the data in a more informative manner. Nevertheless, application of the histogram equalization (HEQ) method in the preprocessing stage (e.g., before the texture feature map is computed) was tested, and when followed by the RLM technique, proved to be useful because the texture feature map quality improved significantly. various texture analysis methods. CLU, clustering with k-means approach; DIR, digital intraoral radiography; FOF, first-order features; GTDM, gray-tone difference matrices; LBP, local binary patterns.

tooth region. (**c**–**f**) Examples of feature maps derived from a digital intraoral radiographic image via

On the other hand, the contrast of some textural feature maps was low and did not present the content clearly. For those, histogram stretching after the final result was applied which aims in scaling the pixel values in order to assure use of the full range of 8-bit color coding (0–255). This transformation does not change the image content but makes it easier for a radiologist to evaluate. There were also some cases where application of histogram equalization gave a better effect. On the other hand, the contrast of some textural feature maps was low and did not present the content clearly. For those, histogram stretching after the final result was applied which aims in scaling the pixel values in order to assure use of the full range of 8-bit color coding (0–255). This transformation does not change the image content but makes it easier for a radiologist to evaluate. There were also some cases where application of histogram equalization gave a better effect.

#### *2.6. Experiment Methodology 2.6. Experiment Methodology*

The native DICOM images and texture feature maps obtained were analyzed on a 4K retina monitor by two radiologists with 10 and 30 years of experience in analysis of classical bone radiograms including dental. Standard DIR images were assessed first. The pictures were then examined separately using the techniques described above, i.e., CLU, COM, FOF, GTDM, RLM, and LBP. The prepared feature images were evaluated for radiodensity, border definition, and tissue contrast. Figure 3 presents the analyzed regions. The aforementioned parameters were estimated separately to evaluate assumptive improvement of visualization and the subsequently increased accuracy in detection of periapical lesions. Radiodensity analysis was used to evaluate bone density changes, border definition presented edge definition of the changes, and gray-scale contrast meant tissue contrast was used for evaluation of the lesion character. All features were summarized in the evaluation chart. Results were encoded in 1/0 code where 0 meant lack of the recognition of the feature and 1 meant its visual confirmation. In order to comprehensively evaluate the texture feature map usability, two main clinical issues were analyzed—sclerotic lesions and lytic lesions. Interrater reliability was at the level of 98% (on the basis of concordance correlation) where doubtful cases were established on the basis of interrater consensus. The native DICOM images and texture feature maps obtained were analyzed on a 4K retina monitor by two radiologists with 10 and 30 years of experience in analysis of classical bone radiograms including dental. Standard DIR images were assessed first. The pictures were then examined separately using the techniques described above, i.e., CLU, COM, FOF, GTDM, RLM, and LBP. The prepared feature images were evaluated for radiodensity, border definition, and tissue contrast. Figure 3 presents the analyzed regions. The aforementioned parameters were estimated separately to evaluate assumptive improvement of visualization and the subsequently increased accuracy in detection of periapical lesions. Radiodensity analysis was used to evaluate bone density changes, border definition presented edge definition of the changes, and gray-scale contrast meant tissue contrast was used for evaluation of the lesion character. All features were summarized in the evaluation chart. Results were encoded in 1/0 code where 0 meant lack of the recognition of the feature and 1 meant its visual confirmation. In order to comprehensively evaluate the texture feature map usability, two main clinical issues were analyzed—sclerotic lesions and lytic lesions. Interrater reliability was at the level of 98% (on the basis of concordance correlation) where doubtful cases were established on the basis of interrater consensus.

(**a**) (**b**)

**Figure 3.** *Cont*.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 6 of 14

**Figure 3.** Example of assessment of a periapical lesion. (**a**) Entry digital intraoral radiographic image. (**b**) Differentiation of radiodensity in the lesion (different shapes are pointed out by the arrows). (**c**) Border definition (pointed out by small arrows). (**d**) Tissue contrast between lesion and the neighborhood (shown by the arrows). **Figure 3.** Example of assessment of a periapical lesion. (**a**) Entry digital intraoral radiographic image. (**b**) Differentiation of radiodensity in the lesion (different shapes are pointed out bythe arrows). (**c**) Border definition (pointed out by small arrows). (**d**) Tissue contrast between lesion and the neighborhood (shown by the arrows).

#### **3. Results 3. Results**

Figure 4 shows the performance of each image processing approach with respect to radiological changes. The original DIR images are presented in the top row, next to the same images after histogram equalization transformation. As shown, despite the images gaining better contrast, the visibility of changes did not improve substantially. In the next row in Figure 4, the region where changes existed is enclosed within a red line. In the following rows, the texture feature maps computed for CLU, FOF energy, GTDM busyness, (HEQ) RLM, short run high gray level run emphasis, and LBP are given. Recognition of the borders of the lesions and their internal structure was markedly improved in comparison with the initial DIR image. In the FOF group, the contours of lytic changes were represented effectively. LBP and CLU revealed previously hidden information about the internal structure of the lesions and tissue contrast. (HEQ) RLM was found to improve visualization of lytic and sclerotic lesions markedly. Figure 4 shows the performance of each image processing approach with respect to radiological changes. The original DIR images are presented in the top row, next to the same images after histogram equalization transformation. As shown, despite the images gaining better contrast, the visibility of changes did not improve substantially. In the next row in Figure 4, the region where changes existed is enclosed within a red line. In the following rows, the texture feature maps computed for CLU, FOF energy, GTDM busyness, (HEQ) RLM, short run high gray level run emphasis, and LBP are given. Recognition of the borders of the lesions and their internal structure was markedly improved in comparison with the initial DIR image. In the FOF group, the contours of lytic changes were represented effectively. LBP and CLU revealed previously hidden information about the internal structure of the lesions and tissue contrast. (HEQ) RLM was found to improve visualization of lytic and sclerotic lesions markedly.

The delineation of sclerotic lesions and internal pattern recognition were achieved with CLU, RLM, and LBP texture feature maps. The delineation of sclerotic lesions and internal pattern recognition were achieved with CLU, RLM, and LBP texture feature maps.

The potential utility of each method was calculated based on data derived from experts, and changes are expressed as sensitivity and specificity for the different groups of texture feature maps that are gathered in the bar plots presented in Figure 5. All samples presenting lesions and marked as such by experts take place for true positive (TP) cases. When there was a change unnoticed by the expert, there was a true negative (TN) result. Then, when the expert noticed the change in the DIR data without lesions, a false positive (FP) result was recorded. False negative (FN) results corresponded to the situation in which the data presented any changes, and the expert confirmed it. Consequently, the formulas for sensitivity and specificity are as follows: The potential utility of each method was calculated based on data derived from experts, and changes are expressed as sensitivity and specificity for the different groups of texture feature maps that are gathered in the bar plots presented in Figure 5. All samples presenting lesions and marked as such by experts take place for true positive (TP) cases. When there was a change unnoticed by the expert, there was a true negative (TN) result. Then, when the expert noticed the change in the DIR data without lesions, a false positive (FP) result was recorded. False negative (FN) results corresponded to the situation in which the data presented any changes, and the expert confirmed it. Consequently, the formulas for sensitivity and specificity are as follows:

$$Sensitivity = \frac{TP}{TP + FN},$$

$$Specificity = \frac{TN}{TN + FP}.$$

A very low percentage value meant that the transformation did not improve visibility of a lesion, while a high percentage value indicated improved visibility and a marked increase in

A very low percentage value meant that the transformation did not improve visibility of a lesion, while a high percentage value indicated improved visibility and a marked increase in potential lesion recognition after texture map utilization in comparison to DIR. Performance of each image processing approach with respect to radiological changes is presented on separate graphs in Figure 5, illustrating the sensitivity and specificity of the proposed methods. Recognition and differentiation of the lytic lesions after use of the texture feature maps showed the highest sensitivity for the (HEQ) RLM texture feature map, with scores of 94%, 89%, and 94% for radiodensity, border definition, and tissue contrast, respectively; specificity for recognition of these parameters was 86%, 89%, and 43%. The next best performing texture feature map was CLU, with a sensitivity of 83%, 77%, and 80%, and a specificity of 74%, 97%, and 51% for radiodensity, border definition, and tissue contrast, respectively. FOF texture feature maps showed relatively low sensitivity for lytic lesions (60%, 69%, and 51%) but high specificity (94%, 91%, and 69%) for recognition of the three chosen radiological features. differentiation of the lytic lesions after use of the texture feature maps showed the highest sensitivity for the (HEQ) RLM texture feature map, with scores of 94%, 89%, and 94% for radiodensity, border definition, and tissue contrast, respectively; specificity for recognition of these parameters was 86%, 89%, and 43%. The next best performing texture feature map was CLU, with a sensitivity of 83%, 77%, and 80%, and a specificity of 74%, 97%, and 51% for radiodensity, border definition, and tissue contrast, respectively. FOF texture feature maps showed relatively low sensitivity for lytic lesions (60%, 69%, and 51%) but high specificity (94%, 91%, and 69%) for recognition of the three chosen radiological features. For the sclerotic lesions, the (HEQ) RLM texture feature map was again found to have the best performance, with a sensitivity of 97%, 80%, and 97% for recognition of radiodensity, border definition, and tissue contrast, respectively. The specificity of the following texture feature maps was lower for recognition of radiodensity changes (47%) but better for border definition (90%) and tissue

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 7 of 14

Figure 5, illustrating the sensitivity and specificity of the proposed methods. Recognition and

For the sclerotic lesions, the (HEQ) RLM texture feature map was again found to have the best performance, with a sensitivity of 97%, 80%, and 97% for recognition of radiodensity, border definition, and tissue contrast, respectively. The specificity of the following texture feature maps was lower for recognition of radiodensity changes (47%) but better for border definition (90%) and tissue contrast differentiation (53%). FOF texture feature maps performed well in detection of sclerotic lesions, with recognition of radiodensity, border definition, and tissue contrast in 73%, 70%, and 57% of cases, respectively, with high specificity for the chosen features of 60%, 83%, and 73%. No important refinement of recognition of sclerotic lesions was observed for CLU texture feature maps, which had low sensitivity of values of 60%, 60%, and 3% for the three radiological features. contrast differentiation (53%). FOF texture feature maps performed well in detection of sclerotic lesions, with recognition of radiodensity, border definition, and tissue contrast in 73%, 70%, and 57% of cases, respectively, with high specificity for the chosen features of 60%, 83%, and 73%. No important refinement of recognition of sclerotic lesions was observed for CLU texture feature maps, which had low sensitivity of values of 60%, 60%, and 3% for the three radiological features. The highest sensitivity for detection of sclerotic lesions was shown for the (HEQ) RLM texture feature maps in terms of radiodensity differentiation, border definition, and tissue contrast recognition, but its specificity was not higher than that of the CLU and FOF texture feature maps. FOF texture feature maps showed good sensitivity for detection of sclerotic lesions and had better

The highest sensitivity for detection of sclerotic lesions was shown for the (HEQ) RLM texture feature maps in terms of radiodensity differentiation, border definition, and tissue contrast recognition, but its specificity was not higher than that of the CLU and FOF texture feature maps. FOF texture feature maps showed good sensitivity for detection of sclerotic lesions and had better specificity than the CLU and RLM texture feature maps. specificity than the CLU and RLM texture feature maps. The best performance in terms of recognition of the features analyzed in both sclerotic and lytic lesions was achieved for the (HEQ) RLM texture feature map, although the FOF texture feature map showed acceptable specificity for recognition of these parameters. CLU was also a well-performing texture feature map for detection of lytic changes, with high sensitivity but lower specificity in

comparison with the FOF texture feature map (see the comparison of parameters in Figure 6).

**Figure 4.** *Cont*.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 8 of 14

**Figure 4.** Example of assessment of a periapical lesion. **Figure 4.** Example of assessment of a periapical lesion.

The best performance in terms of recognition of the features analyzed in both sclerotic and lytic lesions was achieved for the (HEQ) RLM texture feature map, although the FOF texture feature map showed acceptable specificity for recognition of these parameters. CLU was also a well-performing

run-length matrices.

texture feature map for detection of lytic changes, with high sensitivity but lower specificity in comparison with the FOF texture feature map (see the comparison of parameters in Figure 6). *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 9 of 14 *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 9 of 14

**Figure 5.** Sensitivity and specificity of different texture feature maps for detection of lytic (**a**,**b**) and sclerotic (**c**,**d**) lesions. CLU, clustering with k-means approach; FOF, first-order features; GTDM, gray-tone difference matrices; HEQ, histogram equalization; LBP, local binary patterns; RLM, **Figure 5.** Sensitivity and specificity of different texture feature maps for detection of lytic (**a**,**b**) and sclerotic (**c**,**d**) lesions. CLU, clustering with k-means approach; FOF, first-order features; GTDM, gray-tone difference matrices; HEQ, histogram equalization; LBP, local binary patterns; RLM, run-length matrices. sclerotic (**c**,**d**) lesions. CLU, clustering with k-means approach; FOF, first-order features; GTDM, gray-tone difference matrices; HEQ, histogram equalization; LBP, local binary patterns; RLM, run-length matrices.

(**a**) **Figure 6.** *Cont*.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 10 of 14

**Figure 6.** Relationship between specificity and sensitivity for (**a**) lytic lesions and (**b**) sclerotic lesions. Generally, the higher both values are, the better the parameter. CLU, clustering with k-means approach; FOF, first-order features; GTDM, gray-tone difference matrices; HEQ, histogram equalization; LBP, local binary patterns; RLM, run-length matrices. **Figure 6.** Relationship between specificity and sensitivity for (**a**) lytic lesions and (**b**) sclerotic lesions. Generally, the higher both values are, the better the parameter. CLU, clustering with k-means approach; FOF, first-order features; GTDM, gray-tone difference matrices; HEQ, histogram equalization; LBP, local binary patterns; RLM, run-length matrices.

#### **4. Discussion 4. Discussion**

During image analysis, the observer first notices whole objects, anything that has delineated edges, and anything that is in contrast with the surrounding area. However, when the image quality is very low, the content is blurred, the image has no contrast, or an object's texture may appear very similar to that of the background. In such cases, only analysis of local differences that may be hard to discern on standard radiography may unmask some of the vital information that can be derived from the image. Although it may be impossible to detect differences by assessment of plain radiographic images, transforming visual data derived from an image with algorithms designed to map structural differences may reveal hidden content. The process used in our study is presented in Figure 2 in a sequence of images depicting how a substantial amount of additional information about a lesion was gained in comparison to the initial DIR image. During image analysis, the observer first notices whole objects, anything that has delineated edges, and anything that is in contrast with the surrounding area. However, when the image quality is very low, the content is blurred, the image has no contrast, or an object's texture may appear very similar to that of the background. In such cases, only analysis of local differences that may be hard to discern on standard radiography may unmask some of the vital information that can be derived from the image. Although it may be impossible to detect differences by assessment of plain radiographic images, transforming visual data derived from an image with algorithms designed to map structural differences may reveal hidden content. The process used in our study is presented in Figure 2 in a sequence of images depicting how a substantial amount of additional information about a lesion was gained in comparison to the initial DIR image.

The method of using texture feature maps obtained by texture analysis of the DIR images used in the current study is consistent with that used in some previous studies [19,20]. Those methods are applicable to wide range of problems e.g., for identification of macerals [28], defect detection [29]. Moreover, there exists a MaZda system [30], which implements some of the described texture feature maps and returns similar results when compared to our implementation. Systems for differentiating cysts, ameloblastomas, and keratocysts on DIR images are described in those reports. The general approach consists of image preprocessing (e.g., opening, contrast stretching), obtaining similarity measures, and texture analysis. The method of using texture feature maps obtained by texture analysis of the DIR images used in the current study is consistent with that used in some previous studies [19,20]. Those methods are applicable to wide range of problems e.g., for identification of macerals [28], defect detection [29]. Moreover, there exists a MaZda system [30], which implements some of the described texture feature maps and returns similar results when compared to our implementation. Systems for differentiating cysts, ameloblastomas, and keratocysts on DIR images are described in those reports. The general approach consists of image preprocessing (e.g., opening, contrast stretching), obtaining similarity measures, and texture analysis.

In the current study, a much broader set of texture features based on different approaches to image analysis was utilized. FOF feature maps yielded significant improvements in delineation of lytic changes in comparison with DIR. GTDM enhanced visualization of the internal structure within a lytic area and important details of adjacent trabeculation with preserved lesion contours. LBP yielded a surface scene with a clearly differentiated surface pattern at the site of the lytic area. CLU increased tissue contrast in areas of lytic changes. Importantly, (HEQ) RLM increased differentiation of the border and contrast in lytic lesions but did not perform so well for sclerotic changes. Performance of different texture feature maps expressed in terms of sensitivity, specificity, F1 score, and accuracy were summarized in the Table 1 (for the lytic lesions recognition) and the Table 2 (for sclerotic lesions recognition). In the current study, a much broader set of texture features based on different approaches to image analysis was utilized. FOF feature maps yielded significant improvements in delineation of lytic changes in comparison with DIR. GTDM enhanced visualization of the internal structure within a lytic area and important details of adjacent trabeculation with preserved lesion contours. LBP yielded a surface scene with a clearly differentiated surface pattern at the site of the lytic area. CLU increased tissue contrast in areas of lytic changes. Importantly, (HEQ) RLM increased differentiation of the border and contrast in lytic lesions but did not perform so well for sclerotic changes. Performance of different texture feature maps expressed in terms of sensitivity, specificity, F1 score, and accuracy were summarized in the Table 1 (for the lytic lesions recognition) and the Table 2 (for sclerotic lesions recognition).

Few studies of texture analysis have evaluated periapical changes. Possible differentiation of lytic lesions for granulomas and periapical cysts on the basis of radiograms with use of radiometric analysis by histogram calculation and histogram equalization were proposed by Shrout and White,


**Table 1.** Sensitivity, specificity, F1 score, and accuracy in the differentiation of the different diagnostic parameters of the lytic lesions after use of texture feature maps.

**Table 2.** Sensitivity, specificity, F1 score, and accuracy in the differentiation of the different diagnostic parameters of the sclerotic lesions after use of texture feature maps.


Few studies of texture analysis have evaluated periapical changes. Possible differentiation of lytic lesions for granulomas and periapical cysts on the basis of radiograms with use of radiometric analysis by histogram calculation and histogram equalization were proposed by Shrout and White, respectively [31,32]. In another study, after some classic image processing methods (top hat, erosion, and opening) were performed, the skeleton was extracted, and textural features were calculated for the region of interest. A pair of pre-treatment and post-treatment values was then tested in an evaluation of the healing process [33]. The scope of such studies has typically been the detection of areas of alveolar bone on periapical dental radiographs [12–15]. These considerations were focused on segmentation, detection of lesions, and measuring the degree of alveolar bone loss rather than a textural analysis in cases requiring differentiation of lesions. A similar approach has been reported for the analysis of panoramic images in order to enhance the recognition of caries [34]. Another study assessed the treatment effectiveness of guided bone regeneration in cases of post-resectal and post-cystal bone loss on DIR images obtained using RVG 6100 digital radiography equipment (Kodak) [16]. Fractal dimension measurements (power spectral density, triangular prism surface area, blanket, intensity difference scaling, and variogram methods) were performed, and the images became smoother during the healing process after bone loss [35].

Despite the progress to date in the quality of plain radiographic images, DIR still has a number of limitations. In addition to "anatomical noise" and small differences in bone density, there are other phenomena that impede image quality and hinder the anatomical recognition of potentially pathological structures [36,37]. Current technical developments in informatics hardware have made it possible to perform very complicated calculations in a relatively short time at low cost.

The approach used in the current study provides significantly more radiological information than standard DIR images. This comprehensive study is the first where usability of such a broad set of approaches to image texture analysis was presented. Based on our findings, from a radiological perspective, we consider these techniques a step forward in the recognition and precise localization of periapical cystic lesions. A limitation of this retrospective study is the lack of histological verification of the lesions analyzed; however, comparison of texture feature map analysis with histological results will be the issue of an upcoming study of our team. The strengths of this study include evaluation of the most popular textures known to date (mathematical transformations of image analyses applicable to DICOM-based high-resolution DIR images). Future studies should investigate the development of new image processing algorithms based on the current study, and correlations with histopathological specimens in order to evaluate their ability to predict different histologically depicted lesions on the basis of image texture.

#### **5. Conclusions**

The RLM texture feature map significantly improves recognition of lytic and sclerotic lesions, albeit with lower specificity for sclerotic lesions, in comparison to DIR images. CLU, in comparison to the DIR images, markedly increases visualization of lytic lesions with high sensitivity and specificity but is less able to detect the radiological features associated with sclerotic changes. FOF texture feature maps significantly improve detection of the radiological features of both sclerotic and lytic lesions, compared to DIR, with good sensitivity and specificity.

**Author Contributions:** Conceptualization, R.O., K.N. and A.P.; Data curation, R.O. and B.O.; Formal analysis, K.N.; Investigation, R.O., K.N. and A.P.; Methodology, K.N.; Project administration, A.P.; Resources, B.O.; Software, K.N.; Supervision, R.O.; Validation, R.O., B.O. and A.U.; Writing—original draft, R.O., K.N. and A.P.; Writing—review & editing, R.O., K.N., A.U. and A.P.

**Funding:** This publication was funded by AGH University of Science and Technology, Faculty of Electrical Engineering, Automatics, Computer Science and Biomedical Engineering.

**Acknowledgments:** The authors would like to thank B. Wawrzykowska, head of the Denta-Med Kraków Clinic.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Machine Vision System for Counting Small Metal Parts in Electro-Deposition Industry**

## **Rocco Furferi \* , Lapo Governi , Luca Puggelli, Michaela Servi and Yary Volpe**

Department of Industrial Engineering, University of Florence, 50134 Firenze, Italy; lapo.governi@unifi.it (L.G.); luca.puggelli@unifi.it (L.P.); michaela.servi@unifi.it (M.S.); yary.volpe@unifi.it (Y.V.)

**\*** Correspondence: rocco.furferi@unifi.it; Tel.: +39-055-2758741

Received: 20 May 2019; Accepted: 13 June 2019; Published: 13 June 2019

**Featured Application: The present work has application in the field of galvanic coating for the fashion industry by proposing a method and a machine able to count the number of items attached to a galvanic frame.**

**Abstract:** In the fashion field, the use of electroplated small metal parts such as studs, clips and buckles is widespread. The plate is often made of precious metal, such as gold or platinum. Due to the high cost of these materials, it is strategically relevant and of primary importance for manufacturers to avoid any waste by depositing only the strictly necessary amount of material. To this aim, companies need to be aware of the overall number of items to be electroplated so that it is possible to properly set the parameters driving the galvanic process. Accordingly, the present paper describes a simple, yet effective machine vision-based method able to automatically count small metal parts arranged on a galvanic frame. The devised method, which relies on the definition of a rear projection-based acquisition system and on the development of image processing-based routines, is able to properly count the number of items on the galvanic frame. The system is implemented on a counting machine, which is meant to be adopted in the galvanic industrial practice to properly define a suitable set or working parameters (such as the current, voltage, and deposition time) for the electroplating machine and, thereby, assure the desired plate thickness from one side and avoid material waste on the other.

**Keywords:** Machine vision; image analysis; item counting device; electro-deposition industry

### **1. Introduction**

As widely recognized, electroplating (more precisely electrodeposition) is a chemical process that uses electric current to transfer metal from a cation to an electrode (i.e., the object to be treated), to form a coherent thin metal coating [1]. The amount of mass deposit derives from Faraday's laws on electrolysis [2] and directly depends on current intensity and time:

$$m = \frac{M \cdot I \cdot t}{Z \cdot F} \tag{1}$$

where:

m = mass deposited on electrode [g]; M = molar mass of the material to be deposited [g/mol]; I = current intensity (A); t = time (s); Z = valence of material's ions; F = Faraday's constant (96485.33 C mol−<sup>1</sup> ).

The process used by manufacturers working in the electrodeposition of fashion accessories consists of arranging the (usually) small parts to be plated on a frame by using hooks or, more often, metal wires, as shown in Figure 1. The process used by manufacturers working in the electrodeposition of fashion accessories consists of arranging the (usually) small parts to be plated on a frame by using hooks or, more often, metal wires, as shown in Figure 1.

*Appl. Sci.* **2019**, *9*, x 2 of 14

**Figure 1.** Typical galvanic frame. **Figure 1.** Typical galvanic frame.

Therefore, electrodeposition simultaneously occurs on a number of parts. Since the material to be deposited on electrode (multiple items to be plated) is required to form a uniform thin layer, the overall mass is given by: Therefore, electrodeposition simultaneously occurs on a number of parts. Since the material to be deposited on electrode (multiple items to be plated) is required to form a uniform thin layer, the overall mass is given by:

$$
\mathfrak{m} = \mathfrak{n} \cdot \mathfrak{s} \cdot \mathfrak{T} \cdot \mathfrak{p} \tag{2}
$$

Where: where: where:

n = number of items to be electroplated; n = number of items to be electroplated;

s = surface of a single item; s = surface of a single item;

T= coating thickness; T= coating thickness;

 = material mass density. Hence, in order to obtain the desired coating thickness, both the number and surface of the items ρ = material mass density.

to be plated need to be known. While the item's surface is retrievable by means of possibly available items, Computer Aided Design (CAD) models, or by using 3D scanning, the number of items arranged on the galvanic frame is not straightforwardly available in order to compute the overall surface to be electroplated. To date, the parts attached to the frame are manually counted, however, the reliability of the Hence, in order to obtain the desired coating thickness, both the number and surface of the items to be plated need to be known. While the item's surface is retrievable by means of possibly available items, Computer Aided Design (CAD) models, or by using 3D scanning, the number of items arranged on the galvanic frame is not straightforwardly available in order to compute the overall surface to be electroplated.

process is limited by ensuing weakness and inattentiveness; in other words, it is inevitably prone to errors due to the operators' tiredness and lack of attention, etc. In scientific literature, several papers specifically address the topic of designing counting To date, the parts attached to the frame are manually counted, however, the reliability of the process is limited by ensuing weakness and inattentiveness; in other words, it is inevitably prone to errors due to the operators' tiredness and lack of attention, etc.

systems with reference to a variety of industrial fields [3,4]. In addition, many counting machines have been available on the market for years [5,6]. Unfortunately, regardless of the technology adopted (e.g., weight measurement, free-fall, optical scan lines), almost all the machines available on the market require items to be physically separated one from each other (i.e., not disposed on package or frames), or to be arranged upon a moving tray. Therefore, such solutions are not suitable or adaptable to count items that are already arranged on a galvanic frame. Fortunately, machine vision (MV) systems have the potential to solve this issue by implementing a combination of optical devices and proper image processing algorithms, finalized to determine the overall number of objects captured in a scene. Not by chance, a relevant number of MV systems have been proposed in the scientific literature to address the object counting issue [7]. However, the main issue for devising a system for In scientific literature, several papers specifically address the topic of designing counting systems with reference to a variety of industrial fields [3,4]. In addition, many counting machines have been available on the market for years [5,6]. Unfortunately, regardless of the technology adopted (e.g., weight measurement, free-fall, optical scan lines), almost all the machines available on the market require items to be physically separated one from each other (i.e., not disposed on package or frames), or to be arranged upon a moving tray. Therefore, such solutions are not suitable or adaptable to countitems that are already arranged on a galvanic frame. Fortunately, machine vision (MV) systems have the potential to solve this issue by implementing a combination of optical devices and proper image processing algorithms, finalized to determine the overall number of objects captured in a scene. Not by chance, a relevant number of MV systems have been proposed in the scientific literature to address

the object counting issue [7]. However, the main issue for devising a system for counting metal parts attached to a galvanic frame is related to the high reflectivity of the items themselves, which presents an incredible challenge for any kind of optical acquisition system. For this reason, to the best of the authors' knowledge, no automatic counting system has been devised so far for the galvanic industry. Accordingly, the present paper proposes a machine vision-based method to automatically count small metal parts arranged on a galvanic frame. The devised method, relying on the definition of a proper acquisition system and on the development of image processing-based routines, is implemented on a counting machine to be adopted in the galvanic industrial practice. The machine architecture is designed to discard the undesired reflections due to the metal surface so to properly detach all attached items from the background. This allows a set of simple, yet effective image processing algorithms to correctly determine the number of items to be coated in the galvanic bath. Finally, the knowledge of the number of items will allow companies to define a suitable set or working parameters (such as the current, voltage and deposition time) for the electroplating machine and, thereby, assure the desired plate thickness from one side and avoid material waste on the other.

#### **2. Materials and Methods**

As shown in Figure 1, the galvanic frame is formed by 4 tubular beams welded to compose a rectangle. In the general configuration, on the shorter sides, several hooks are joined. The workers use inert metal wires to knot together a variable number of items. Successively, each wire is linked to a couple of corresponding hooks (i.e., the nth on the upper side with the nth on the lower) so that the wire results in an arrangement on the frame along an approximately vertical direction. Once all the couples of hooks are filled, the galvanic frame is sent to the electroplating bath. Considering that items can be very small in size (down to 10 mm on the shorter side), the variability in length (thus in mass) of the wire itself precludes the adoption of any weight-based approach for the counting system. Moreover, two consecutive items can be attached at a relative distance down to 20 mm.

Consequently, attention has been focused on computer vision-based approaches. The main idea is to properly acquire a 2D digital image of the frame, on which to detect each item by means of computer vision (CV) tools [7].

#### *2.1. Literature Methods*

According to scientific literature, several different CV approaches can be adopted. Considering the task, three among them seem to deserve further investigation: Deterministic template-matching, neural network-based algorithms or brightness-based segmentation. The applicability, effectiveness, and robustness of each of them strictly depend on the typology of the image to be analyzed. It has to be noted that other approaches, such as color-based ones, are not applicable since the wire color can be very close to that of the items.

In more detail, deterministic template-matching algorithms are intended to find, into an image, instances of a given template. For example, OCR (optical character recognition) procedures—which recognize text within pictures (e.g., a PDF file)—are usually built based on template matching algorithms. This approach would be optimal to solve our problem if only the items were arranged on a rigid grid, so that each item was oriented in the same way with respect to the camera. In fact, some companies use galvanic frames, where items are placed into a fixed position on the frame, as shown in Figure 2.

*Appl. Sci.* **2019**, *9*, x 4 of 14

*Appl. Sci.* **2019**, *9*, x4of 14

**Figure 2.** Galvanic frame with items placed in a fixed position. **Figure 2.** Galvanic frame with items placed in a fixed position. **Figure 2.** Galvanic frame with items placed in a fixed position.

However for such cases, the operators usually fill all the available slots with items, therefore, the number of items on the galvanic frame is known a priori. However, as already mentioned, in the general configuration described previously, the items are knotted on wires and, consequently, their orientation in space is far from being equal. This issue inevitably limits the applicability of deterministic template-based algorithms and makes their adoption inconvenient for the specific case analyzed in this paper. However for such cases, the operators usually fill all the available slots with items, therefore, the number of items on the galvanic frame is known a priori. However, as already mentioned, in the general configuration described previously, the items are knotted on wires and, consequently, their orientation in space is far from being equal. This issue inevitably limits the applicability of deterministic template-based algorithms and makes their adoption inconvenient for the specific case analyzed in this paper. for such the operators fill slots items,therefore,the number of items on the galvanic frame is known a priori. However, as already mentioned, in the configuration described previously, items are knotted consequently, their orientation in space is far from being equal. This issue inevitably limits the applicability of deterministic template-based and makes their adoption inconvenient for case in this paper.

With respect to this limitation, an evolution of the template-based algorithm, as defined above, can be found in the neural network (NN)-based approaches [8–11]. Some of them, in fact, are able to detect a specific object independently from its orientation and position in the scene. Among them, YOLO (you only look once) [12] is a state-of-the-art real-time object detection system, targeted for real-time processing. Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (humans, cars, etc.) in digital images and videos. With respect to this limitation, an evolution of the template-based algorithm, as defined above, can be found in the neural network (NN)-based approaches [8–11]. Some of them, in fact, are able to detect a specific object independently from its orientation and position in the scene. Among them, YOLO (you only look once) [12] is a state-of-the-art real-time object detection system, targeted for real-time processing. Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (humans, cars, etc.) in digital images and videos. With this limitation, of template-based algorithm, as above, can be found in the neural network [8of them, are able to specific object independently orientation and position Among them, (you only look once) [12] is a state-of-the-art real-time object detection system, targeted for is technology related to computer vision image that deals with detecting of a cars, videos.

Differently from prior approaches, which apply the model to an image at multiple locations and scales and then high scoring regions of the image are considered detections, YOLO applies the network to the full image. Specifically, the image is divided into an S x S grid and the algorithm returns bounding boxes and predicted probabilities for each of these regions. The method used to compute these probabilities is logistic regression [13]. This way, other than performing a very fast detection, predictions are informed by the global context in the image. Differently from prior approaches, which apply the model to an image at multiple locations and scales and then high scoring regions of the image are considered detections, YOLO applies the network to the full image. Specifically, the image is divided into an S x S grid and the algorithm returns bounding boxes and predicted probabilities for each of these regions. The method used to compute these probabilities is logistic regression [13]. This way, other than performing a very fast detection, predictions are informed by the global context in the image. prior approaches, which to an image at multiple locations and high scoring regions of the image applies to the full Specifically, image an S and the returns boxes and predicted of these regions. method used is logistic regression [13]. This way, very fast

Off-the-shelf YOLO nets with pre-trained weights, but is not able to give predictions on our subject of interest, as they have not been trained in detecting these particular objects (see Figure 3). Off-the-shelf YOLO nets with pre-trained weights, but is not able to give predictions on our subject of interest, as they have not been trained in detecting these particular objects (see Figure 3). by the context in nets pre-trained weights, but is able give predictions of have trained in detecting these particular

**Figure 3.** Prediction result obtained with an off-the-shelf pre-trained YOLO (you only look once) **Figure 3.**Prediction result obtained with an off-the-shelf pre-trained YOLO (you only look once) **Figure 3.** Prediction result obtained with an off-the-shelf pre-trained YOLO (you only look once) net.

net. On the other hand, to achieve a proper result using these networks, it is not sufficient to provide net. the to a result is sufficient to provide On the other hand, to achieve a proper result using these networks, it is not sufficient to provide a limited number of training images (e.g., ten to twenty images). In the light of these considerations, and

given that the shape of the items can change frequently (each month, all lots can be completely new), it is not practical for users to train the algorithm each time.

For this reason, another approach has been explored: The classical brightness-based segmentation [14–16]. Assuming that the item brightness (or range of brightness) is different from that of the background, it is possible to isolate the background itself. The resulting image contains only pixels belonging to the items and connecting wires. Unfortunately, wire and item colors can be so close that color segmentation cannot be used to separate the one from the other. Fortunately, since wires are thinner than items, their pixels can be removed by means of CV tools such as pixel erosion/dilation. The number of separate clusters of pixels describing the items can then be easily retrieved by means of labeling tools.

In detail, a threshold value (or at least a range) must be used in order to isolate on the image only pixels relative to the items to be counted. Supposing that the threshold operation works flawlessly, a binary image can be obtained, where white pixels represent items to be counted and black pixels are the background and the wires. Afterward, many well-known algorithms can be used to count the number of isolated regions in binary images.

#### *2.2. Image Acquisition Requirements*

The brightness-based segmentation approach seems the most promising for the specific application but needs to be tailored to the peculiarities entailed by the small dimensions and high reflectivity of the items to be counted. Consequently, the definition of a proper input image has primary importance for the success of the method.

Depending on the finishing and on the material of which the items are made, their aspect is rarely opaque but rather it is highly reflexive. Obviously, even color may change, varying from copper-like to silver and gold, thus resulting in different brightness. All these characteristics make it very difficult (or even impossible) to obtain satisfactory threshold values or ranges, on which set the segmentation.

To make it more complicated, the silhouette of the same items knotted to the galvanic frame varies significantly, due to their almost-random orientation. Moreover, the placement is far from being equally spaced.

Therefore, in order to make the segmentation algorithms effective, it is of critical importance to obtain a suitable image, where it is possible to separate the items from the background. To this purpose, three different lighting settings have been considered in order to evaluate their efficacy in favoring image segmentation operations:


#### *2.3. Light Settings*

In industrial practice, a single galvanic frame is filled with a number of identical items. In order to make the performed tests representative and speed-up the testing process, we chose to use typical galvanic frames filled with a variety of items of different shapes and dimensions arranged on vertical metal wires, instead of using multiple frames (each one with a single item typology). Image acquisition was carried out by using a Fujifilm T1 SLR camera (APS-C sensor format, Fujifilm Holdings Corporation, Tokio, Japan) and an 18 mm focal length lens. The acquired images had a resolution of 15.8 megapixels (4826 × 3264 pixels).

#### 2.3.1. Frontal Light with Black Uniform Background

The first tested layout setting is meant to physically isolate items from the rest of the scene by putting an opaque black canvas behind the frame. The frame, containing four different item typologies, is positioned approximately perpendicular to the camera optical axis at a distance of 500 mm, so that

below:

the frame occupies completely the field of view. The set (see Figure 4) is illuminated by a frontal lighting source (800 lm focusable LED torch, Essilor International S.A., Paris, France). This layout allows obtaining an almost-uniform black background on which the item shape appears enhanced. a frontal lighting source (800 lm focusable LED torch, Essilor International S.A., Paris, France). This layout allows obtaining an almost-uniform black background on which the item shape appears enhanced.

*Appl. Sci.* **2019**, *9*, x 6 of 14

a frontal lighting source (800 lm focusable LED torch, Essilor International S.A., Paris, France). This layout allows obtaining an almost-uniform black background on which the item shape appears

**Figure 4.** Frontal light with black uniform background setup. **Figure 4.** Frontal light with black uniform background setup. illumination (see for instance Figure 5a, referred to four different kind of items). However, this layout

In the resulting digital image, background pixels are characterized by low brightness. On the contrary, the items pixel brightness has, as expected, high values due to the frontal and strong illumination (see for instance Figure 5a, referred to four different kind of items). However, this layout is not optimal for a number of reasons. First, the background subtraction (i.e., to subtract from the image to be analyzed a reference image of the background canvas acquired prior to positioning the galvanic frame) is not applicable, since shadows/reflections projected on the canvas by the items and the wires make the background itself different from the reference. In the resulting digital image, background pixels are characterized by low brightness. On the contrary, the items pixel brightness has, as expected, high values due to the frontal and strong illumination (see for instance Figure 5a, referred to four different kind of items). However, this layout is not optimal for a number of reasons. First, the background subtraction (i.e., to subtract from the image to be analyzed a reference image of the background canvas acquired prior to positioning the galvanic frame) is not applicable, since shadows/reflections projected on the canvas by the items and the wires make the background itself different from the reference. is not optimal for a number of reasons. First, the background subtraction (i.e., to subtract from the image to be analyzed a reference image of the background canvas acquired prior to positioning the galvanic frame) is not applicable, since shadows/reflections projected on the canvas by the items and the wires make the background itself different from the reference. In addition, brightness-based segmentation leads to two additional main issues, as explained below: • Items and background pixels may be incorrectly detected/assigned; • The wire and item brightness are similar and thus difficult to separate.

**Figure 5.** (**a**) Acquired image using the first setup: (**b**) Image after thresholding and (**c**) image after morphological opening. **Figure 5.** (**a**) Acquired image using the first setup: (**b**) Image after thresholding and (**c**) image after morphological opening.

(a) (b) (c) **Figure 5.** (**a**) Acquired image using the first setup: (**b**) Image after thresholding and (**c**) image after Starting from Figure 5a, it was not possible to isolate the items from the wires by thresholding, as shown in Figure 5b, since the resulting binary image also contained also the wires. The only In addition, brightness-based segmentation leads to two additional main issues, as explained below:


Starting from Figure 5a, it was not possible to isolate the items from the wires by thresholding, as shown in Figure 5b, since the resulting binary image also contained also the wires. The only filtering operation that could allow wires deletion is image erosion. Unfortunately, since the item dimensions and wire thicknesses are similar, the operation (even if combined with successive dilation filtering, thus performing a morphological opening) led to sub-fragmentation of single items into multiple pixel clusters (see Figure 5c), thus invalidating the successive counting operation. Observing in detail the image of two items (named item A and B, respectively) right after filtering, thus performing a morphological opening) led to sub-fragmentation of single items into multiple pixel clusters (see Figure 5c), thus invalidating the successive counting operation. Observing in detail the image of two items (named item A and B, respectively) right after thresholding (Figure 6a,c), the first issue became evident. It was noted that, for some items, the Starting from Figure 5a, it was not possible to isolate the items from the wires by thresholding, as shown in Figure 5b, since the resulting binary image also contained also the wires. The only filtering operation that could allow wires deletion is image erosion. Unfortunately, since the item dimensions and wire thicknesses are similar, the operation (even if combined with successive dilation filtering, thus performing a morphological opening) led to sub-fragmentation of single items into multiple pixel clusters (see Figure 5c), thus invalidating the successive counting operation.

thresholding (Figure 6a,c), the first issue became evident. It was noted that, for some items, the

Figure 7c).

Observing in detail the image of two items (named item A and B, respectively) right after thresholding (Figure 6a,c), the first issue became evident. It was noted that, for some items, the darkest pixels were mistakenly assigned to background. Consequently, some of them already fragmented into multiple parts (see Figure 6b,d). In other cases, the effects were less evident but equally dangerous, due to the successive (required) erosion operation. As shown in Figure 6c, the items may result so thinned that successive operations unavoidably cause fragmentation. *Appl. Sci.* **2019**, *9*, x 7 of 14

*Appl. Sci.* **2019**, *9*, x 7 of 14

**Figure 6.** (**a**) Item A, fragmented after thresholding; (**b**) item A, after morphological opening; (**c**) item B, thinned after thresholding; (**d**) item B, fragmented after morphological opening. **Figure 6.** (**a**) Item A, fragmented after thresholding; (**b**) item A, after morphological opening; (**c**) item B, thinned after thresholding; (**d**) item B, fragmented after morphological opening. the morphological opening on the binary image (i.e., the erosion followed by dilation) removed both the wires and bridges, thus causing undesired fragmentation of the cluster (Figure 6d). Similarly, in

In more detail, sub-fragmentation can occur in two possible scenarios: In the presence of bridges (Figure 6a) or in the case of inner holes (Figure 6c). In the first case, the thickness of the bridge may be similar to the thickness of the wires to be removed. Consequently, a common occurrence was that the morphological opening on the binary image (i.e., the erosion followed by dilation) removed both the wires and bridges, thus causing undesired fragmentation of the cluster (Figure 6d). Similarly, in the case of the inner holes—given by the actual shape of the item or caused by thresholding morphological opening may cause fragmentation. In more detail, sub-fragmentation can occur in two possible scenarios: In the presence of bridges (Figure 6a) or in the case of inner holes (Figure 6c). In the first case, the thickness of the bridge may be similar to the thickness of the wires to be removed. Consequently, a common occurrence was that the morphological opening on the binary image (i.e., the erosion followed by dilation) removed both the wires and bridges, thus causing undesired fragmentation of the cluster (Figure 6d). Similarly, in the case of the inner holes—given by the actual shape of the item or caused by thresholding—morphological opening may cause fragmentation. the case of the inner holes—given by the actual shape of the item or caused by thresholding morphological opening may cause fragmentation. This issue can be possibly avoided by using a morphological image closure (i.e., the dilation followed by erosion) followed by an additional erosion. Figure 7a demonstrates the result of such an operation applied to Figure 6c. However, this solution may also lead to some unwanted side effects that make this alternative unsuitable. In fact, in Figure 7b, it can be noted that, in some cases, the wires formed closed loops in

This issue can be possibly avoided by using a morphological image closure (i.e., the dilation followed by erosion) followed by an additional erosion. Figure 7a demonstrates the result of such an operation applied to Figure 6c. This issue can be possibly avoided by using a morphological image closure (i.e., the dilation followed by erosion) followed by an additional erosion. Figure 7a demonstrates the result of such an operation applied to Figure 6c. the image. Such loops may result in being completely closed by a morphological closing filtering. If sufficiently large, they can be easily mistaken for items in the counting phase. In addition, if a couple of items are sufficiently close, the filter may cause the fusion of the relative clusters into one (see Figure 7c).

However, this solution may also lead to some unwanted side effects that make this alternative

**Figure 7.** (**a**) Item B after morphological image closing plus image erosion; (**b**) an example of wires loop after morphological closing followed by image erosion; (**c**) clusters merged after dilation. The image is intentionally left in grayscale to show actual original items. **Figure 7.** (**a**) Item B after morphological image closing plus image erosion; (**b**) an example of wires loop after morphological closing followed by image erosion; (**c**) clusters merged after dilation. The image is intentionally left in grayscale to show actual original items.

(a) (b) (c)

image is intentionally left in grayscale to show actual original items.

However, this solution may also lead to some unwanted side effects that make this alternative unsuitable. In fact, in Figure 7b, it can be noted that, in some cases, the wires formed closed loops in the image. Such loops may result in being completely closed by a morphological closing filtering. If sufficiently large, they can be easily mistaken for items in the counting phase. In addition, if a couple of items are sufficiently close, the filter may cause the fusion of the relative clusters into one (see Figure 7c). Between the two alternatives proposed above, the better performance proved to be the first one (i.e., the morphological opening-based solution). Starting from the morphologically opened image (see Figure 5c), the connected regions representing the actual items needed to be discriminated from the ones representing small wire portions and/or item fragments. To this aim, an elective method could be to perform area-based discrimination, carried out by imposing an appropriate area threshold. Since the item dimensions are widely variable and unknown a priori, a fixed area threshold

Between the two alternatives proposed above, the better performance proved to be the first one (i.e., the morphological opening-based solution). Starting from the morphologically opened image (see Figure 5c), the connected regions representing the actual items needed to be discriminated from the ones representing small wire portions and/or item fragments. To this aim, an elective method could be to perform area-based discrimination, carried out by imposing an appropriate area threshold. value cannot be based on the item dimension itself. On the contrary, the wire dimension is constant. Accordingly, it is possible to define a fixed area threshold under which clusters are considered too small to be an item, thus must be ignored. Considering that the wire thickness is approximately 1 mm—corresponding to 7 pixels in the image (based on the shooting setup described in the previous section), a limit dimension was set at 4 mm2—corresponding to 200 pixels. In the example shown in

Since the item dimensions are widely variable and unknown a priori, a fixed area threshold value cannot be based on the item dimension itself. On the contrary, the wire dimension is constant. Accordingly, it is possible to define a fixed area threshold under which clusters are considered too small to be an item, thus must be ignored. Considering that the wire thickness is approximately 1 mm—corresponding to 7 pixels in the image (based on the shooting setup described in the previous section), a limit dimension was set at 4 mm2—corresponding to 200 pixels. In the example shown in Figure 8a, this method allowed appropriate discarding of small clusters. Figure 8a, this method allowed appropriate discarding of small clusters. However, in several other situations, such as the one depicted in Figure 8b, this criterion led to misclassification. This was mainly due to the heavy image cluster fragmentation induced by the acquisition setup and the subsequent image filtering. Other than the simple criterion described above, other more complex techniques have been tested in order to cluster pixel regions, namely k-means clustering and Support Vector Machine (SVM) [17,18]. The results, not detailed in the present paper, show that this misclassification still occurs.

**Figure 8.** (**a**) Discrimination among valid and ignored clusters; (**b**) misclassification of small clusters: Red-colored clusters are discarded since their area is lower than the selected threshold (i.e., 200 pixels); green-colored clusters are counted, thus leading to counting error since both belongs to a **Figure 8.** (**a**) Discrimination among valid and ignored clusters; (**b**) misclassification of small clusters: Red-colored clusters are discarded since their area is lower than the selected threshold (i.e., 200 pixels); green-colored clusters are counted, thus leading to counting error since both belongs to a single item.

3.1.2. Lighted Background Moving from the issues faced with the first setting, the second layout makes use of backlighting. The galvanic frame, containing a set of identical items, is arranged between the camera and an approximately uniformly illuminated white background (see Figure 9). However, in several other situations, such as the one depicted in Figure 8b, this criterion led to misclassification. This was mainly due to the heavy image cluster fragmentation induced by the acquisition setup and the subsequent image filtering. Other than the simple criterion described above, other more complex techniques have been tested in order to cluster pixel regions, namely k-means clustering and Support Vector Machine (SVM) [17,18]. The results, not detailed in the present paper, show that this misclassification still occurs.

#### 2.3.2. Lighted Background

single item.

Moving from the issues faced with the first setting, the second layout makes use of backlighting. The galvanic frame, containing a set of identical items, is arranged between the camera and an approximately uniformly illuminated white background (see Figure 9).

*Appl. Sci.* **2019**, *9*, x 9 of 14

**Figure 9.** Lighted background setup. **Figure 9.** Lighted background setup. fragmentation issues arose. In fact, despite the setup, the entire background was better detected and isolated. Some item portions, which appeared light due to the inter-reflections mentioned above,

Under the proper camera settings, the lights saturate the brightness for the background, while the pixels belonging to the items generally appear darker (Figure 10a). Under the proper camera settings, the lights saturate the brightness for the background, while the pixels belonging to the items generally appear darker (Figure 10a). were mistakenly assigned to the background (see Figure 10b). Even using morphological operators similar to the ones described in Section 3.1.1, the fragmentation issue persisted, making it practically unfeasible to correctly classify the pixel clusters.

**Figure 10.** (**a**) Original image acquired with backlighting; (**b**) image of Figure 10a after thresholding showing that some items portions are mistakenly assigned to the background. **Figure 10.** (**a**) Original image acquired with backlighting; (**b**) image of Figure 10a after thresholding showing that some items portions are mistakenly assigned to the background.

(a) (b) **Figure 10.** (**a**) Original image acquired with backlighting; (**b**) image of Figure 10a after thresholding showing that some items portions are mistakenly assigned to the background. 3.1.3. Rear Projection To overcome all the discussed drawbacks related to direct backlighting, a third solution was developed and tested. In detail, a 0.5 mm thickness white canvas for rear-projection (100% polyvinyl chloride - PVC) was placed at a 20 mm distance from the galvanic frame, containing seven item typologies while the light source (in this case, an overhead projector with a 3300 lumen light source) and camera were arranged as depicted in Figure 11. This architecture allowed acquiring, from the scene, the projected item shadows rather than the items themselves. This enabled discarding of any Awkwardly, many item regions appeared bright due to specular reflections/inter-reflections among the items themselves. Similarly to the configuration described in the previous section, over-fragmentation issues arose. In fact, despite the setup, the entire background was better detected and isolated. Some item portions, which appeared light due to the inter-reflections mentioned above, were mistakenly assigned to the background (see Figure 10b). Even using morphological operators similar to the ones described in Section 2.3.1, the fragmentation issue persisted, making it practically unfeasible to correctly classify the pixel clusters.

#### 3.1.3. Rear Projection kind of reflection. The light source came from an LCD overhead projector with 1000 lumens. 2.3.3. Rear Projection

To overcome all the discussed drawbacks related to direct backlighting, a third solution was developed and tested. In detail, a 0.5 mm thickness white canvas for rear-projection (100% polyvinyl chloride - PVC) was placed at a 20 mm distance from the galvanic frame, containing seven item typologies while the light source (in this case, an overhead projector with a 3300 lumen light source) and camera were arranged as depicted in Figure 11. This architecture allowed acquiring, from the scene, the projected item shadows rather than the items themselves. This enabled discarding of any To overcome all the discussed drawbacks related to direct backlighting, a third solution was developed and tested. In detail, a 0.5 mm thickness white canvas for rear-projection (100% polyvinyl chloride - PVC) was placed at a 20 mm distance from the galvanic frame, containing seven item typologies while the light source (in this case, an overhead projector with a 3300 lumen light source) and camera were arranged as depicted in Figure 11. This architecture allowed acquiring, from the

kind of reflection. The light source came from an LCD overhead projector with 1000 lumens.

scene, the projected item shadows rather than the items themselves. This enabled discarding of any kind of reflection. The light source came from an LCD overhead projector with 1000 lumens. *Appl. Sci.* **2019**, *9*, x 10 of 14

*Appl. Sci.* **2019**, *9*, x 10 of 14

**Figure 11.** Rear projection setup. **Figure 11.** Rear projection setup. pixel clusters. Therefore, by using an area threshold equal to 200 pixels, it was possible to correctly

Experimental tests showed that, due to the item thickness and shape, the minimum distance between the projector and the frame needed to be set at 1.8 m, in order to avoid shadow blurring. As depicted in Figure 12a, the items cast a very sharp and uniform dark shadow on the canvas. At the same time, the wires appeared thinner than (for instance) the ones shown in Figure 5b. Experimental tests showed that, due to the item thickness and shape, the minimum distance between the projector and the frame needed to be set at 1.8 m, in order to avoid shadow blurring. As depicted in Figure 12a, the items cast a very sharp and uniform dark shadow on the canvas. At the same time, the wires appeared thinner than (for instance) the ones shown in Figure 5b. count the item number. In summary, the rear projection setup proved to be the most suitable among the ones tested in order to correctly isolate items to be counted. In fact, the key point of the procedure resided in the very sharp native image, in which the shadows were extremely defined. Consequently, the required filtering operations were far less aggressive than needed in previous cases.

**Figure 12.** (**a**) Original image acquired with the rear projection setup; (**b**) binary image after morphological opening process. **Figure 12.** (**a**) Original image acquired with the rear projection setup; (**b**) binary image after morphological opening process.

For this reason, this method has been selected to design the counting machine, as described in the next Session. **3. Rear Projection-Based Counting Machine Prototype** Starting from the acquired image (see Figure 12a) a binary image was obtained by thresholding, using the Otsu method [18]. Subsequently, the resulting image was filtered using a 3 × 3 morphological image opening. As shown in Figure 12b, this approach led to minimally fragmented pixel clusters. Therefore, by using an area threshold equal to 200 pixels, it was possible to correctly count the item number.

(a) (b) **Figure 12.** (**a**) Original image acquired with the rear projection setup; (**b**) binary image after morphological opening process. Though the preliminary tests performed using the seven different item typologies, shown in Figure 12a, were deemed representative, a prototypal rear projection-based counting machine was In summary, the rear projection setup proved to be the most suitable among the ones tested in order to correctly isolate items to be counted. In fact, the key point of the procedure resided in the very sharp native image, in which the shadows were extremely defined. Consequently, the required filtering operations were far less aggressive than needed in previous cases.

For this reason, this method has been selected to design the counting machine, as described in the next Session. For this reason, this method has been selected to design the counting machine, as described in the next Session.

**3. Rear Projection-Based Counting Machine Prototype**

#### **3. Rear Projection-Based Counting Machine Prototype** - An LCD overhead light projector, to assure uniform lighting;

rather than a conventional light source (i.e., a lamp).

Though the preliminary tests performed using the seven different item typologies, shown in Figure 12a, were deemed representative, a prototypal rear projection-based counting machine was designed in order to perform extensive testing in an industrial environment. As shown in Figure 13, the system comprises: - A couple of orientable mirrors (used to extend the light path up to the 1.8 m, mentioned in section 3). Such mirrors are used to reduce the overall dimensions of the counting machine, which must not exceed 1.5 × 1.0 × 1.0 m, in order to not to be excessively cumbersome for an industrial environment; - An enclosure system, to assure the environmental light does not affect the scene.

*Appl. Sci.* **2019**, *9*, x 11 of 14

designed in order to perform extensive testing in an industrial environment. As shown in Figure 13,



mm lens with 12-megapixel resolution (4104 × 3006 pixels);


**Figure 13.** Counting system main components. **Figure 13.** Counting system main components.

In Figure 14a, a rendering of the designed counting machine architecture shows the arrangement of the above-mentioned components. The final design of the machine is shown in Figure 14b. The projector is placed backward, on the frontal part of the machine. Light is reflected by the first mirror upwards towards the second one; this last mirror reflects it forward to hit the galvanic frame. Its shadow is projected on the rear-projection canvas, which is arranged parallel to the frame. In the prototypal implementation of the system, the setup described above did not show any unevenness in the screen illumination using a perfectly white projected image. However, in case this should happen, it is possible to compensate by projecting an image appositely designed in order to feature slightly darker or lighter regions in correspondence to more or less illuminated areas of the screen, respectively. This is a major advantage entailed by the use of the overhead LCD projector rather than a conventional light source (i.e., a lamp).

In Figure 14a, a rendering of the designed counting machine architecture shows the arrangement of the above-mentioned components. The final design of the machine is shown in Figure 14b.

A Surface Go tablet (Microsoft Corporation, Washington, U.S.)—on which ran the designed application (developed in Matlab®, MathWorks, Inc., Natick, Massachusetts, U.S., 2019)—was then used to command the industrial camera. By means of a dedicated Graphical User Interface (GUI), the operator could check the position of the frame and can start the acquisition when such a position was considered correct (see Figure 15).

(a) (b)

the system comprises:

industrial environment;

rather than a conventional light source (i.e., a lamp).

**Figure 13.** Counting system main components.

of the above-mentioned components. The final design of the machine is shown in Figure 14b.

designed in order to perform extensive testing in an industrial environment. As shown in Figure 13,


The projector is placed backward, on the frontal part of the machine. Light is reflected by the first mirror upwards towards the second one; this last mirror reflects it forward to hit the galvanic frame. Its shadow is projected on the rear-projection canvas, which is arranged parallel to the frame. In the prototypal implementation of the system, the setup described above did not show any unevenness in the screen illumination using a perfectly white projected image. However, in case this should happen, it is possible to compensate by projecting an image appositely designed in order to feature slightly darker or lighter regions in correspondence to more or less illuminated areas of the screen, respectively. This is a major advantage entailed by the use of the overhead LCD projector

mm lens with 12-megapixel resolution (4104 × 3006 pixels); - An LCD overhead light projector, to assure uniform lighting;



**Figure 14.** (**a**) Counting machine architecture; (**b**) final design of the counting machine. operator could check the position of the frame and can start the acquisition when such a position was considered correct (see Figure 15).

**Figure 15.** Dedicated GUI implemented for controlling the counting machine performance. On the left, it is possible to read the number of items. Moreover, it is possible to save the screened image (green camera icon) and to close the application (red X in figure). The user can also access a setting panel (blue gear icon) in case they want to set a different threshold value for the algorithm. **Figure 15.** Dedicated GUI implemented for controlling the counting machine performance. On the left, it is possible to read the number of items. Moreover, it is possible to save the screened image (green camera icon) and to close the application (red X in figure). The user can also access a setting panel (blue gear icon) in case they want to set a different threshold value for the algorithm.

The procedure was then able, in approximately 0.3s, to provide the number of the detected items. Simultaneously, it showed a control picture, on which clusters that had been considered were colored red, while those that were ignored were white. In this way, the operator could rapidly check the effectiveness of the procedure and make corrections if needed. The procedure was then able, in approximately 0.3s, to provide the number of the detected items. Simultaneously, it showed a control picture, on which clusters that had been considered were colored red, while those that were ignored were white. In this way, the operator could rapidly check the effectiveness of the procedure and make corrections if needed.

#### **4. Discussion and Conclusions 4. Discussion and Conclusions**

results obtained for 5 of the 20 tests are listed.

galvanic frame.

In this paper, a method and a machine for counting the number of small metal parts randomly arranged on a galvanic frame was proposed. A priori knowledge of the area of each item that will be treated by the galvanic bath made it possible to estimate, with satisfying accuracy, the overall area to be treated and, consequently, optimize the settings of the treatment itself. Especially in the high fashion field, in which precious materials are often used to realize plates, this enables minimization of material waste, thus leading to a significant cost saving. Considering all the limitations that the application imposes (e.g., pieces already mounted on the frame and high reflectivity), many of the approaches usually adopted for counting machines (e.g., free fall and weight analysis) cannot be In this paper, a method and a machine for counting the number of small metal parts randomly arranged on a galvanic frame was proposed. A priori knowledge of the area of each item that will be treated by the galvanic bath made it possible to estimate, with satisfying accuracy, the overall area to be treated and, consequently, optimize the settings of the treatment itself. Especially in the high fashion field, in which precious materials are often used to realize plates, this enables minimization of material waste, thus leading to a significant cost saving. Considering all the limitations that the application imposes (e.g., pieces already mounted on the frame and high reflectivity), many of the approaches usually adopted for counting machines (e.g., free fall and weight analysis) cannot be

**Table 1.** The results obtained with the three developed systems for counting the items attached to a

This prototypal counting machine was pre-tested with 20 different galvanic frames, hosting 20 different kinds of objects, with maximum dimensions spanning from 10 to 80 mm. In Table 1, the followed. The procedure is hence based on machine vision and makes use of rear-projection on a canvas to obtain a sufficiently sharp and easy to elaborate image with simple morphological operators. A counting machine, which implements the devised system was designed.

This prototypal counting machine was pre-tested with 20 different galvanic frames, hosting 20 different kinds of objects, with maximum dimensions spanning from 10 to 80 mm. In Table 1, the results obtained for 5 of the 20 tests are listed. *Appl. Sci.* **2019**, *9*, x 13 of 14


**Table 1.** The results obtained with the three developed systems for counting the items attached to a galvanic frame. **Test Actual # Minimum Item Frontal Lighted** 

**Rear Projection (# of items)**

\* Without image closure

Referring to the entire set of 20 tests, the frontal light architecture showed proper counting of

Referring to the entire set of 20 tests, the frontal light architecture showed proper counting of the number of the items in nine cases (45%), while the lighted background-based architecture was successful in eight cases (40%). For both the systems, over-fragmentation led to an excessive number of items counted. This was particularly true when the number of items increased and the minimum dimensions decreased down to 50 mm. Therefore, their use is not recommended for this kind of application, since an overestimation of the number of attached items may cause a coating thickness lower than the desired one. By using the frontal light-based architecture with the addition of the image morphological closure algorithm, the percentage of correctly counted items increased to 60%. However in two cases out of 12 correctly counted frames, the number of items erroneously counted twice, due to clusters fragmentation, were compensated by the erroneous counting of two adjacent items merged together by the image closure. Quite the reverse, for all the test cases, the number of counted objects was exactly equal to the number of actual objects mounted on the frames (100%). As already mentioned, since the surface of each item is a technical specification for the galvanic companies, by multiplying it by the number of items, it is possible to know the overall surface to be coated. the number of the items in nine cases (45%), while the lighted background-based architecture was successful in eight cases (40%). For both the systems, over-fragmentation led to an excessive number of items counted. This was particularly true when the number of items increased and the minimum dimensions decreased down to 50 mm. Therefore, their use is not recommended for this kind of application, since an overestimation of the number of attached items may cause a coating thickness lower than the desired one. By using the frontal light-based architecture with the addition of the image morphological closure algorithm, the percentage of correctly counted items increased to 60%. However in two cases out of 12 correctly counted frames, the number of items erroneously counted twice, due to clusters fragmentation, were compensated by the erroneous counting of two adjacent items merged together by the image closure. Quite the reverse, for all the test cases, the number of counted objects was exactly equal to the number of actual objects mounted on the frames (100%). As already mentioned, since the surface of each item is a technical specification for the galvanic companies, by multiplying it by the number of items, it is possible to know the overall surface to be coated.

Despite these encouraging results, the system will undergo an extensive test campaign in an Italian company working in the galvanic coating industry to increase the number of test cases up to 1000 different frames. Despite these encouraging results, the system will undergo an extensive test campaign in an Italian company working in the galvanic coating industry to increase the number of test cases up to 1000 different frames.

Accordingly, future work will extensively test the devised procedure, both in terms of the performance (i.e., the counted number of items vs. the actual number, verified by visually inspecting the frames) and of the usability. Accordingly, future work will extensively test the devised procedure, both in terms of the performance (i.e., the counted number of items vs. the actual number, verified by visually inspecting the frames) and of the usability.

**Author Contributions:** Conceptualization, L.P, L.G. and R.F; methodology, L.G. and R.F.; software, M.S. and L.P.; validation, Y.V., L.P. and M.S.; data curation, M.S. and Y.V.; writing—original draft preparation, L.P.; writing—review and editing, L.G., R.F and Y.V.; supervision, L.G. and R.F.; project administration, L.G. **Author Contributions:** Conceptualization, L.P, L.G. and R.F; methodology, L.G. and R.F.; software, M.S. and L.P.; validation, Y.V., L.P. and M.S.; data curation, M.S. and Y.V.; writing—original draft preparation, L.P.; writing—review and editing, L.G., R.F and Y.V.; supervision, L.G. and R.F.; project administration, L.G.

**Funding:** This work has been carried out thanks to the decisive regional contribution from the Regional Implementation Programme co-financed by the FAS (now FSC) and the contribution from the FAR funds made available by the MIUR. **Funding:** This work has been carried out thanks to the decisive regional contribution from the Regional Implementation Programme co-financed by the FAS (now FSC) and the contribution from the FAR funds made available by the MIUR.

**Conflicts of Interest:** The authors declare no conflict of interest

**Conflicts of Interest:** The authors declare no conflict of interest.

1980; p. 718.

*69*, 158–162.

5–7 December 2012.

**References**

1. Bard, A.J.; Faulkner, L.R. *Electrochemical Methods: Fundamentals and Applications*; Wiley: Hoboken, NJ, USA,

2. Barker, D.; Walsh, F.C. Applications of Faraday's Laws of Electrolysis in Metal Finishing. *Trans. IMF* **1991**,

3. Phromlikhit, C.; Cheevasuvit, F.; Yimman, S. Tablet counting machine base on image processing. In Proceedings of the 5th 2012 Biomedical Engineering International Conference, Ubon Ratchathani, Thailand,

## **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **An Improved MB-LBP Defect Recognition Approach for the Surface of Steel Plates**

## **Yang Liu, Ke Xu \* and Jinwu Xu**

Collaborative Innovation Center of Steel Technology, University of Science and Technology, Beijing 100083, China; liuyang\_ustb\_1988@163.com (Y.L.); jwxu@ustb.edu.cn (J.X.)

**\*** Correspondence: xuke@ustb.edu.cn; Tel.: +86-10-6233-2159

Received: 27 August 2019; Accepted: 30 September 2019; Published: 10 October 2019

**Abstract:** The detection of surface defects is very important for the quality improvement of steel plates. In actual production, as the steel plate production line runs faster, the steel surface defect detection algorithm is required to meet the requirements of real-time detection (less than 100 ms/image), and the detection accuracy is improved (at least 90%). In this paper, an improved multi-block local binary pattern (LBP) algorithm is proposed. This algorithm not only has the simplicity and efficiency of the LBP algorithm, but also finds a suitable scale to describe the defect features by changing the block sizes, thus ensuring high recognition accuracy. The experiment proves that the method satisfies the requirements of online real-time detection in terms of speed (63 ms/image), and surpasses the widely-used scale invariant feature transform (SIFT), speeded up robust features (SURF), gray-level co-occurrence matrix (GLCM), and LBP algorithms in recognition accuracy (94.30%), which prove that the MB-LBP has practical application value in an online real-time detection system.

**Keywords:** MB-LBP; surface defect detection; feature extraction; defect recognition

### **1. Introduction**

Steel plates are widely used in engineering fields such as ships, bridges, machinery, construction, and automobile manufacturing. In the production process of a plate, due to the rolling process, various types of defects are easily formed on the surface of steel plates such as cracks, scratches, indentations, pits and scales [1]. These defects have a great impact on the appearance and performance of the product, so it is extremely important to detect the surface defects on the plate.

The surface defect recognition algorithm for steel plates is the core part of the entire surface defect detection system. Yun et al. [2] proposed a new defect detection algorithm, which is based on Gabor filters. The Gabor filters are optimized using a new optimization algorithm known as the univariate dynamic encoding algorithm for searches. The algorithm finds the minimum value of the cost function related to the energy separation criteria between the defect and the defect-free regions. Xu et al. [3] proposed a classifier based on Ads Boosting algorithms for the classification of defects with textural features that employed non-sampling wavelet decomposition to the scale co-occurrence matrix of the low-pass component and the grayscale co-occurrence matrix of the high-pass component. Pan et al. [4] proposed an engineering-driven rule-based detection (ERD) method. The ERD consists of three detection stages using the pixel features of bleeds, which are transferred from the physical features generated via engineering knowledge. Yu et al. [5] proposed a surface trait extraction method based on complex Contourlet decomposition, which has the characteristics of shift invariance, excellent directional selectivity, and a higher retrieval rate. Industrial test results show that the accurate rate of classifying surface features is about 90%, and it can be used in image feature extraction and slab defect detection. Miyamoto et al. [6] proposed a one-shot measurement using the plane-wave with the time-of-flight (TOF) based transmission method and validated the method using wave propagation

simulations. The defect in billets can be detected regardless of the defect position even in the vicinity of the surfaces of a billet. Yan et al. [7] proposed a mathematical morphology detection method based on multi-scale element, which can not only filter the noise effectively, but can also delete the false characteristics of the cracks, scales, and slag. Lin et al. [8] proposed a robust detection method based on the vision attention mechanism and deep learning of feature map in order to relieve the problem of the false and missed detection of casting defects in x-ray detection. The experimental results showed that the false rate and missed rate for the detection of casting defects were less than 4%, and the accuracy of the defect detection was more than 96%. Di et al. [9] proposed a new semi-supervised learning method based on convolutional auto encoder (CAE) and semi-supervised generative adversarial networks (SGAN) to classify the surface defects of steels. Jiang et al. [10] suggested a method for detecting the appearance defect of castings based on a deep residual network. This method divides the casting into multiple regions, preprocesses the image of each region, and then inputs the processed image into the convolutional neural network to extract the features, before finally determining whether the sample has defects. even in the vicinity of the surfaces of a billet. Yan et al. [7] proposed a mathematical morphology detection method based on multi-scale element, which can not only filter the noise effectively, but can also delete the false characteristics of the cracks, scales, and slag. Lin et al. [8] proposed a robust detection method based on the vision attention mechanism and deep learning of feature map in order to relieve the problem of the false and missed detection of casting defects in x-ray detection. The experimental results showed that the false rate and missed rate for the detection of casting defects were less than 4%, and the accuracy of the defect detection was more than 96%. Di et al. [9] proposed a new semi-supervised learning method based on convolutional auto encoder (CAE) and semisupervised generative adversarial networks (SGAN) to classify the surface defects of steels. Jiang et al. [10] suggested a method for detecting the appearance defect of castings based on a deep residual network. This method divides the casting into multiple regions, preprocesses the image of each region, and then inputs the processed image into the convolutional neural network to extract the features, before finally determining whether the sample has defects. All of the above methods effectively solve the problem of recognition accuracy. However, in

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 2 of 14

wave propagation simulations. The defect in billets can be detected regardless of the defect position

All of the above methods effectively solve the problem of recognition accuracy. However, in actual production, the steel production line runs very fast, so the surface defect detection algorithm needs to meet the real-time requirements while ensuring the accuracy. actual production, the steel production line runs very fast, so the surface defect detection algorithm needs to meet the real-time requirements while ensuring the accuracy. In this paper, we propose the multi-block local binary pattern (MB-LBP) algorithm to extract the

In this paper, we propose the multi-block local binary pattern (MB-LBP) algorithm to extract the features of the surface defects of steel plates. The experimental results show that the MB-LBP algorithm meets the requirements of the online detection of steel plate defects in terms of both speed and accuracy. features of the surface defects of steel plates. The experimental results show that the MB-LBP algorithm meets the requirements of the online detection of steel plate defects in terms of both speed and accuracy. The rest of this paper is organized as follows. Section 2 introduces the surface defects of steel

The rest of this paper is organized as follows. Section 2 introduces the surface defects of steel plates. Section 3 introduces the principle of MB-LBP. Section 4 introduces the experiments and analysis of surface defect detection based on MB-LBP. Finally, the conclusions are discussed in Section 5. plates. Section 3 introduces the principle of MB-LBP. Section 4 introduces the experiments and analysis of surface defect detection based on MB-LBP. Finally, the conclusions are discussed in Section 5.

#### **2. Surface Defects of Steel Plates 2. Surface Defects of Steel Plates**

Based on our observations and research, the surface defects of slabs can be divided into five types: cracks, scratches, indentations, pits, and scales. The following are several common defect samples collected from the surface defect online detection system. Based on our observations and research, the surface defects of slabs can be divided into five types: cracks, scratches, indentations, pits, and scales. The following are several common defect samples collected from the surface defect online detection system.

#### *2.1. Cracks 2.1. Cracks*

Cracks are the most serious defect on the surface of steel plates, as shown in Figure 1. Cracks may cause tremendous damage in the following rolling procedure [11]. Cracks are the most serious defect on the surface of steel plates, as shown in Figure 1. Cracks may cause tremendous damage in the following rolling procedure [11].

#### **Figure 1.** Cracks. **Figure 1.** Cracks.

#### *2.2. Scratches 2.2. Scratches*

Scratches are generally due to friction between the mechanical equipment or the relative motion of the plates on the roller table [12]. Scratches mostly appear as bright stripes in the image, as shown in Figure 2. Since scratches are mostly caused by mechanical equipment, they appear periodically at most times. That is, scratches appear continually in adjacent images, and their positions and features are similar, so scratches are easier to classify. Scratches are generally due to friction between the mechanical equipment or the relative motion of the plates on the roller table [12]. Scratches mostly appear as bright stripes in the image, as shown in Figure 2. Since scratches are mostly caused by mechanical equipment, they appear periodically at most times. That is, scratches appear continually in adjacent images, and their positions and features are similar, so scratches are easier to classify.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 3 of 14

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 3 of 14

FOR PEER of 14

**Figure 2.** Scratches. **Figure 2.** Scratches. **Figure 2.**Scratches.**Figure 2.** Scratches.

#### *2.3. Indentations 2.3. Indentations 2.3. Indentations*

The occurrence of indentations is due to inclusions in the plates during continuous casting, and pits appear on the surface of the billet, or as depressions on the surface of the roller table. The typical images of indentation are shown in Figure 3. Casting temperature, improper control of casting speed, entrapment of protective slag, etc., can cause defects on the surface and inside of the strand, especially surface flaws, pits, and buckling defects. During the heating and rolling process, it is possible to further form an indentation. The indentation seriously affects the quality of the casting blank as the indentation size varies, the direction is uncertain, and the manual inspection is difficult. The occurrence of indentations is due to inclusions in the plates during continuous casting, and pits appear on the surface of the billet, or as depressions on the surface of the roller table. The typical images of indentation are shown in Figure 3. Casting temperature, improper control of casting speed, entrapment of protective slag, etc., can cause defects on the surface and inside of the strand, especially surface flaws, pits, and buckling defects. During the heating and rolling process, it is possible to further form an indentation. The indentation seriously affects the quality of the casting blank as the indentation size varies, the direction is uncertain, and the manual inspection is difficult. The occurrence of indentationsis due to inclusions in the platesduring continuous casting, and of billet, the roller The images of indentation are shown in Figure3. Casting temperature, improper control of casting speed, of slag, cause defects the inside the flaws, and buckling During heating and process, is The seriously affects quality casting as thedirection is and the The occurrence of indentations is due to inclusions in the plates during continuous casting, and pits appear on the surface of the billet, or as depressions on the surface of the roller table. The typical images of indentation are shown in Figure 3. Casting temperature, improper control of casting speed, entrapment of protective slag, etc., can cause defects on the surface and inside of the strand, especially surface flaws, pits, and buckling defects. During the heating and rolling process, it is possible to further form an indentation. The indentation seriously affects the quality of the casting blank as the indentation size varies, the direction is uncertain, and the manual inspection is difficult.

**Figure 3.** Indentations. Indentations. **Figure 3.** Indentations. **Figure 3.** Indentations.

#### *2.4. Pits 2.4. Pits 2.4. Pits*

Pits are pit-shaped defects with a certain depth due to the periodical vibration of the mold during the production of the plates. Typical images of a pit are shown in Figure 4. The pits easily cause indentations during the subsequent hot rolling process, which is one of the important factors affecting the surface quality of the plates. In addition, cracks may occur at the bottom of deep pits. Pits are common on the surfaces of the plates. Shallow pits have little effect on the surface quality of the plates, but deeper pits require more attention. Pits are pit-shaped certain the the images of in Figure 4. The indentations the rolling is the the surface the plates. cracks occur at pits. are common little the surface the plates, require more Pits are pit-shaped defects with a certain depth due to the periodical vibration of the mold during the production of the plates. Typical images of a pit are shown in Figure 4. The pits easily cause indentations during the subsequent hot rolling process, which is one of the important factors affecting the surface quality of the plates. In addition, cracks may occur at the bottom of deep pits. Pits are common on the surfaces of the plates. Shallow pits have little effect on the surface quality of the plates, but deeper pits require more attention. Pits are pit-shaped defects with a certain depth due to the periodical vibration of the mold during the production of the plates. Typical images of a pit are shown in Figure 4. The pits easily cause indentations during the subsequent hot rolling process, which is one of the important factors affecting the surface quality of the plates. In addition, cracks may occur at the bottom of deep pits. Pits are common on the surfaces of the plates. Shallow pits have little effect on the surface quality of the plates, but deeper pits require more attention.

**Figure 4.** Pits. **4. Figure 4.** Pits. **Figure 4.** Pits.

#### *2.5. Scales 2.5. Scales 2.5. Scales*

The surface temperature of the continuous plates during production is usually around 1000 °C, so the surface is easily oxidized to form oxide scales. The typical images of scales are shown in Figure 5. Scales do not seriously affect the quality of the plate, but they have a great negative effect on the recognition of the surface defects of the plates. The plates form The images the of negative effect on of the defects of The surface temperature of the continuous plates during production is usually around 1000 °C, so the surface is easily oxidized to form oxide scales. The typical images of scales are shown in Figure 5. Scales do not seriously affect the quality of the plate, but they have a great negative effect on the recognition of the surface defects of the plates. The surface temperature of the continuous plates during production is usually around 1000 ◦C, so the surface is easily oxidized to form oxide scales. The typical images of scales are shown in Figure 5. Scales do not seriously affect the quality of the plate, but they have a great negative effect on the recognition of the surface defects of the plates.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 4 of 14

**Figure 5.** Scales. **Figure 5.** Scales.

#### **3. Principle of Multi-Block Local Binary Pattern 3. Principle of Multi-Block Local Binary Pattern**

#### *3.1. Principle of Local Binary Pattern 3.1. Principle of Local Binary Pattern*

Due to the high ambient temperature and complex image background, the detection of surface defects for steel plates becomes a puzzling problem. The change of illumination will cause a linear change in the gray level of the image. The local binary pattern (LBP) is a method to describe the texture of the image that can eliminate linear illumination by comparing the gray values of pixels. LBP was introduced by Ojala [13,14] and has been widely applied in many fields [15] such as metal surface quality detection [16], paper quality detection [17], image texture analysis [18], and target detection [19]. Due to the high ambient temperature and complex image background, the detection of surface defects for steel plates becomes a puzzling problem. The change of illumination will cause a linear change in the gray level of the image. The local binary pattern (LBP) is a method to describe the texture of the image that can eliminate linear illumination by comparing the gray values of pixels. LBP was introduced by Ojala [13,14] and has been widely applied in many fields [15] such as metal surface quality detection [16], paper quality detection [17], image texture analysis [18], and target detection [19].

The LBP descriptor was initially proposed as an effective grayscale-invariant texture descriptor based on image gray levels. The basic LBP descriptor is defined in a 3 × 3 pixel area. The gray value of the center pixel is s ݃ . The gray values of its 8 neighborhood pixels are ݃ ... ݃ , respectively. The texture description T of the center pixel can be expressed as Equation (1): The LBP descriptor was initially proposed as an effective grayscale-invariant texture descriptor based on image gray levels. The basic LBP descriptor is defined in a 3 × 3 pixel area. The gray value of the center pixel is s *gc*. The gray values of its 8 neighborhood pixels are *g*0... *g*7, respectively. The texture description T of the center pixel can be expressed as Equation (1):

$$\mathbf{T} \sim (\mathbf{g}\_0 - \mathbf{g}\_{\varepsilon \prime}, \dots, \mathbf{g}\_7 - \mathbf{g}\_{\varepsilon}) \tag{1}$$

. If the value is greater

(3)

than ݃ , the binarization value of the pixel is 1, otherwise the binarization value of the pixel is 0. Then, the texture description T after binarization can be expressed as Equation (2). T~(ݏ)݃ − ݃(, ⋯ , s(݃ − ݃)) (2) Then, compare the gray values of the eight neighborhood pixels with *gc*. If the value is greater than *gc*, the binarization value of the pixel is 1, otherwise the binarization value of the pixel is 0. Then, the texture description T after binarization can be expressed as Equation (2).

0 ≫ ݔ 0,

After binarization, randomly choose one of eight neighborhood pixels as a starting point. It is

$$\mathbf{T} \sim (\mathbf{s}(\mathbf{g}\_0 - \mathbf{g}\_c), \dots, \mathbf{s}(\mathbf{g}\_7 - \mathbf{g}\_c)) \tag{2}$$

s(x) = ቄ s(x) is calculated by Equation (3):

an LBP image can be obtained [20].

$$\mathbf{s}(\mathbf{x}) = \begin{cases} \ 1, \mathbf{x} > 0 \\ \ 0, \mathbf{x} \ll 0 \end{cases} \tag{3}$$

value of pixel (ݔ ݕ , ) is calculated in Equation (4). ݔ)LBP ݕ , (݃ − ݃)ݏ = ( ୀ 2 (4) Figure 6 shows the complete process of the local binary pattern (LBP). After binarization, randomly choose one of eight neighborhood pixels as a starting point. It is worth noting that the choice of starting point is random, but the choice of starting point for the entire image should be consistent, for convenience, this paper consistently selected the top left point as the starting point. Then, encode all binarization values clockwise to a binary number. Therefore, the LBP value of pixel (*xc*, *yc*) is calculated in Equation (4).

starting point. Then, encode all binarization values clockwise to a binary number. Therefore, the LBP

$$\text{LBP}(\mathbf{x}\_{\mathcal{L}}, y\_{\mathcal{L}}) = \sum\_{i=0}^{7} s(g\_i - g\_{\mathcal{L}}) 2^i \tag{4}$$

Figure 6 shows the complete process of the local binary pattern (LBP).

The LBP value for each pixel could be calculated by the above calculation process. In this way, an LBP image can be obtained [20].

**Figure 6.** Complete process of the local binary pattern (LBP).

The LBP value for each pixel could be calculated by the above calculation process. In this way,

value of pixel (ݔ

detection [19].

than ݃

of the center pixel is s ݃

s(x) is calculated by Equation (3):

ݕ ,

(݃ − ݃)ݏ = (

ୀ

**Figure 5.** Scales.

Due to the high ambient temperature and complex image background, the detection of surface defects for steel plates becomes a puzzling problem. The change of illumination will cause a linear change in the gray level of the image. The local binary pattern (LBP) is a method to describe the texture of the image that can eliminate linear illumination by comparing the gray values of pixels. LBP was introduced by Ojala [13,14] and has been widely applied in many fields [15] such as metal surface quality detection [16], paper quality detection [17], image texture analysis [18], and target

The LBP descriptor was initially proposed as an effective grayscale-invariant texture descriptor based on image gray levels. The basic LBP descriptor is defined in a 3 × 3 pixel area. The gray value

, the binarization value of the pixel is 1, otherwise the binarization value of the pixel is 0.

T~(ݏ)݃ − ݃(, ⋯ , s(݃ − ݃)) (2)

2  ... ݃

, ⋯ , ݃ − ݃) (1)

, respectively. The

(3)

(4)

. If the value is greater

. The gray values of its 8 neighborhood pixels are ݃

0 < ݔ 1, 0 ≫ ݔ 0,

After binarization, randomly choose one of eight neighborhood pixels as a starting point. It is worth noting that the choice of starting point is random, but the choice of starting point for the entire image should be consistent, for convenience, this paper consistently selected the top left point as the starting point. Then, encode all binarization values clockwise to a binary number. Therefore, the LBP

texture description T of the center pixel can be expressed as Equation (1):

) is calculated in Equation (4).

ݔ)LBP

ݕ ,

T~(݃ −݃

Then, compare the gray values of the eight neighborhood pixels with ݃

Then, the texture description T after binarization can be expressed as Equation (2).

s(x) = ቄ

**3. Principle of Multi-Block Local Binary Pattern**

*3.1. Principle of Local Binary Pattern*

**Figure 6.** Complete process of the local binary pattern (LBP). **Figure 6.** Complete process of the local binary pattern (LBP).

#### *3.2. Principle of MB-LBP*

The LBP value for each pixel could be calculated by the above calculation process. In this way, an LBP image can be obtained [20]. A series of improved LBP algorithms have been proposed such as the LBP uniform pattern and LBP rotation-invariant pattern. All of these LBP algorithms are always focused on a single point. However, different defects have different sizes, and the appropriate scale to describe their texture features is different (larger defects should be described by more points), so different block sizes (with different number of points) should be selected. Therefore, this paper proposed the multi-block LBP (MB-LBP) to describe features of various defects with different sizes [21].

The specific steps of the MB-LBP are as follows:

#### (1) Division:

The source image is divided into small blocks, and each block contains n × n pixels (take 2 × 2 as an example).

#### (2) Binarization:

Calculate the mean gray value of all blocks, then compare the mean gray value of each block with the mean gray value of its neighborhood block. If the mean gray value of the neighborhood block is 50% larger than the center block, then set it as 1, otherwise set it as 0, as shown in Equation (5):

$$s(g\_i, g\_c) = \begin{cases} 1, \frac{|g\_i - g\_c|}{|g\_c|} > 50\% \\ 0, \frac{|g\_i - g\_c|}{|g\_c|} \le 50\% \end{cases} \tag{5}$$

where *g<sup>i</sup>* is the mean gray value of each neighborhood block and *g<sup>c</sup>* is the mean gray value of the center block.

Another improvement of the MB-LBP to LBP is that, when calculating *s*(*g<sup>i</sup>* , *gc*), it no longer simply compares the mean value of the neighborhood block with the center block, but compares the percentage of the difference of the mean value with the center block. Therefore, the robustness of the MB-LBP can be enhanced.

The binarization process is shown in Figure 7.

(3) Computation of MB-LBP pattern value:

First, randomly choose one of the eight neighborhood blocks as a starting point. It is worth noting that the choice of starting point is random, but the choice of starting point for the entire image should be consistent, for convenience, this paper consistently chose the top left point as the starting point.

Then, encode all binarization values clockwise to a binary number as calculated by Equation (6)

$$\text{LMB} - \text{LBP}\_p^R(\mathbf{x}\_{\mathcal{C}}, \mathbf{y}\_{\mathcal{C}}) = \sum\_{p=0}^{p-1} s(g\_{i\prime}, g\_{\mathcal{C}}) \mathfrak{D}^p \tag{6}$$

#### (4) Get MB-LBP pattern value of entire image:

*Appl. Sci.* **2019**, *9*, 4222

Calculate the MB-LBP pattern value of each block traversing the entire image, from left to right, top to bottom. For the edge block, its neighborhood block is missing, and it has little effect on the texture feature of defects located inside the image, so THE MB-LBP pattern value of the edge block is ignored. *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 6 of 14

In this way, an entire MB-LBP image can be obtained, as shown in Figure *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 8. 6 of 14 *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 6 of 14

**Figure 7.** Binarization process of the MB-LBP. **Figure 7.** Binarization process of the MB-LBP. **Figure 7.** Binarization process of the MB-LBP.

**Figure 8.** Obtain the MB-LBP pattern value of the entire image. **Figure 8.** Obtain the MB-LBP pattern value of the entire image. **Figure 8.** Obtain the MB-LBP pattern value of the entire image. **4. Experiment and Analysis of Surface Defects Detection Based on MB-LBP**

#### **4. Experiment and Analysis of Surface Defects Detection Based on MB-LBP 4. Experiment and Analysis of Surface Defects Detection Based on MB-LBP 4. Experiment and Analysis of Surface Defects Detection Based on MB-LBP**

#### *4.1. Experimental Design 4.1. Experimental Design 4.1. Experimental Design*

*4.1. Experimental Design*

#### 4.1.1. Architecture of Surface Defect Detection System 4.1.1. Architecture of Surface Defect Detection System 4.1.1. Architecture of Surface Defect Detection System

(1) Integral image segmentation.

(1) Integral image segmentation.

(1) Integral image segmentation.

4.1.1. Architecture of Surface Defect Detection System A complete online surface defect detection system for steel plates should include image capturing, image preprocessing, defect feature extraction, and defect classification [22], which is shown in Figure 9. In pre-processing, the image segmentation algorithm is used to mark suspected defect areas. It can not only make preparations for identifying the sizes and locations of defects, but can also greatly reduce the amount of calculation for defect feature extraction and defect classification and identification. After the defect areas are determined, the defect features are extracted by the feature extraction algorithm. Then, the obtained feature vectors will be input into the classifier for A complete online surface defect detection system for steel plates should include image capturing, image preprocessing, defect feature extraction, and defect classification [22], which is shown in Figure 9. In pre-processing, the image segmentation algorithm is used to mark suspected defect areas. It can not only make preparations for identifying the sizes and locations of defects, but can also greatly reduce the amount of calculation for defect feature extraction and defect classification and identification. After the defect areas are determined, the defect features are extracted by the feature extraction algorithm. Then, the obtained feature vectors will be input into the classifier for classification and identification. Finally, a series of information such as class, size, location, and A complete online surface defect detection system for steel plates should include image capturing, image preprocessing, defect feature extraction, and defect classification [22], which is shown in Figure 9. In pre-processing, the image segmentation algorithm is used to mark suspected defect areas. It can not only make preparations for identifying the sizes and locations of defects, but can also greatly reduce the amount of calculation for defect feature extraction and defect classification and identification. After the defect areas are determined, the defect features are extracted by the feature extraction algorithm. Then, the obtained feature vectors will be input into the classifier for classification and identification. Finally, a series of information such as class, size, location, and severity of the defect can be obtained. A complete online surface defect detection system for steel plates should include image capturing, image preprocessing, defect feature extraction, and defect classification [22], which is shown in Figure 9. In pre-processing, the image segmentation algorithm is used to mark suspected defect areas. It can not only make preparations for identifying the sizes and locations of defects, but can also greatly reduce the amount of calculation for defect feature extraction and defect classification and identification. After the defect areas are determined, the defect features are extracted by the feature extraction algorithm. Then, the obtained feature vectors will be input into the classifier for classification and identification. Finally, a series of information such as class, size, location, and severity of the defect can be obtained.

extraction **Figure 9.** The flow chart of the surface defect detection system for steel plates. **Figure 9.** The flow chart of the surface defect detection system for steel plates. **Figure 9.** The flow chart of the surface defect detection system for steel plates.

The steps of image segmentation in image preprocessing is as follows:

The steps of image segmentation in image preprocessing is as follows:

4.1.2. Algorithms of Image Segmentation in Image Preprocessing

**Figure 9.** The flow chart of the surface defect detection system for steel plates.

identification

#### 4.1.2. Algorithms of Image Segmentation in Image Preprocessing

The steps of image segmentation in image preprocessing is as follows:

(1) Integral image segmentation.

The concept of integral images was proposed by Viola and Jones [23]. With the integral image, the grayscale sum of pixels in a rectangular area can be quickly calculated, which represents the features of this area. The value of any point in the integral image is the grayscale sum of all pixels from the upper left corner to this point, that is: *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 7 of 14 The concept of integral images was proposed by Viola and Jones [23]. With the integral image, the grayscale sum of pixels in a rectangular area can be quickly calculated, which represents the *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 7 of 14 The concept of integral images was proposed by Viola and Jones [23]. With the integral image,

$$p(i,j) = \sum\_{i'$$

where *pix*(*i*, *j*) is the grayscale of the pixel located at (*i*, *j*). ' & ' *i i j j* ( , ) ( ', ') *p i j pix i j* (7)

As shown in Figure 10, first, the original image is equally divided into blocks. The length and width of the block can be changed as a power of two (eight pixels in this paper). With the integral image, the mean grayscale of each block can be quickly calculated, which represents the features of this block. where *pix i j* ( , ) is the grayscale of the pixel located at ( , ) *i j* . As shown in Figure 10, first, the original image is equally divided into blocks. The length and width of the block can be changed as a power of two (eight pixels in this paper). With the integral image, the mean grayscale of each block can be quickly calculated, which represents the features of this block. ' & ' *i i j j* where *pix i j* ( , ) is the grayscale of the pixel located at ( , ) *i j* . As shown in Figure 10, first, the original image is equally divided into blocks. The length and width of the block can be changed as a power of two (eight pixels in this paper). With the integral image, the mean grayscale of each block can be quickly calculated, which represents the features of

**Figure 10.** Divide original image into blocks. **Figure 10.** Divide original image into blocks. (**a**) The original image (**b**) The divided blocks **Figure 10.** Divide original image into blocks.

#### (2) Differential calculation (2) Differential calculation

(3) Threshold setting

this block.

Compare each block with four neighbor blocks and obtain all difference values. Then, select the difference value with the largest absolute value as the feature value of the current block. All feature values form a block difference matrix. The result is shown in Figure 11. The difference value in the block difference matrix reflects the change of the grayscale in the image. Blocks with a small difference value and gentle grayscale changes represent a no defect area; while blocks with a large difference value and sharp grayscale changes mean a suspected defect area. Compare each block with four neighbor blocks and obtain all difference values. Then, select the difference value with the largest absolute value as the feature value of the current block. All feature values form a block difference matrix. The result is shown in Figure 11. The difference value in the block difference matrix reflects the change of the grayscale in the image. Blocks with a small difference value and gentle grayscale changes represent a no defect area; while blocks with a large difference value and sharp grayscale changes mean a suspected defect area. (2) Differential calculation Compare each block with four neighbor blocks and obtain all difference values. Then, select the difference value with the largest absolute value as the feature value of the current block. All feature values form a block difference matrix. The result is shown in Figure 11. The difference value in the block difference matrix reflects the change of the grayscale in the image. Blocks with a small difference value and gentle grayscale changes represent a no defect area; while blocks with a large difference value and sharp grayscale changes mean a suspected defect area.

(3) Threshold setting **Figure 11.** Results of the four neighbor block difference. **Figure 11.** Results of the four neighbor block difference.

in Figure 12, the upper plane represents the high threshold and the lower plane represents the low

Now, the question is how to select a reasonable threshold to determine whether the difference value is large or small. Set two thresholds, high and low, to divide the grayscale difference. As shown in Figure 12, the upper plane represents the high threshold and the lower plane represents the low

### (3) Threshold setting

Now, the question is how to select a reasonable threshold to determine whether the difference value is large or small. Set two thresholds, high and low, to divide the grayscale difference. As shown in Figure 12, the upper plane represents the high threshold and the lower plane represents the low threshold. The blocks with a difference value greater than the high threshold is identified as "master", blocks with a difference value less than the low threshold is identified as "abort", and blocks with a difference value between high and low threshold is identified as "candidate". The "master" blocks are considered to be a defect area; the "candidate" block is a candidate defect area, and only when it has a "master" eight-neighbor block, will it be promoted to a "master" block. *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 8 of 14 threshold. The blocks with a difference value greater than the high threshold is identified as "master", blocks with a difference value less than the low threshold is identified as "abort", and blocks with a difference value between high and low threshold is identified as "candidate". The "master" blocks are considered to be a defect area; the "candidate" block is a candidate defect area, and only when it has a "master" eight-neighbor block, will it be promoted to a "master" block. *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 8 of 14 threshold. The blocks with a difference value greater than the high threshold is identified as "master", blocks with a difference value less than the low threshold is identified as "abort", and blocks with a difference value between high and low threshold is identified as "candidate". The "master" blocks are considered to be a defect area; the "candidate" block is a candidate defect area, and only when it has a "master" eight-neighbor block, will it be promoted to a "master" block.

**Figure 12.** Division of difference with high and low threshold. **Figure 12.**Division of difference with high and low threshold. **Figure 12.** Division of difference with high and low threshold.

(4) Regional growth (4) Regional growth (4) Regional growth

Usually, a complete defect area is composed of multiple blocks. Therefore, it is necessary to merge adjacent "master" blocks into larger areas. That is, when there is a "master" block in eightneighbor blocks of a "master" block, merge them as a new "master" block. Repeat the operations above until the new "master" block has no "master" block in its neighborhoods. Usually, a complete defect area is composed of multiple blocks. Therefore, it is necessary to merge adjacent "master" blocks into larger areas. That is, when there is a "master" block in eight-neighbor blocks of a "master" block, merge them as a new "master" block. Repeat the operations above until the new "master" block has no "master" block in its neighborhoods. Usually, a complete defect area is composed of multiple blocks. Therefore, it is necessary to merge adjacent "master" blocks into larger areas. That is, when there is a "master" block in eightneighbor blocks of a "master" block, merge them as a new "master" block. Repeat the operations above until the new "master" block has no "master" block in its neighborhoods.

4.1.3. Experiment of Defect Feature Extraction and Defect Classification in this Paper 4.1.3. Experiment of Defect Feature Extraction and Defect Classification in this Paper 4.1.3. Experiment of Defect Feature Extraction and Defect Classification in this Paper

In this paper, MB-LBP was employed in feature extraction and support vector machine (SVM) was employed in the defect classification and identification. In this paper, MB-LBP was employed in feature extraction and support vector machine (SVM) was employed in the defect classification and identification. In this paper, MB-LBP was employed in feature extraction and support vector machine (SVM) was employed in the defect classification and identification.

The complete experiment process is shown in Figure 13. The complete experiment process is shown in Figure 13. The complete experiment process is shown in Figure 13.

**Figure 13.** Feature extraction based on support vector machine (SVM). **Figure 13.** Feature extraction based on support vector machine (SVM). **Figure 13.** Feature extraction based on support vector machine (SVM).

The process of the MB-LBP + SVM experiment is detailed as follows: The process of the MB-LBP + SVM experiment is detailed as follows: The process of the MB-LBP + SVM experiment is detailed as follows:

(1) Image division (1) Image division (1) Image division

(2) MB-LBP pattern value calculation

(2) MB-LBP pattern value calculation

(3) Feature vector generation

(3) Feature vector generation

The image is divided into small blocks with n × n (n = 2, 4, 8, and so on) pixels, and then the mean gray value of the blocks are calculated as the gray value of the block. The image is divided into small blocks with n × n (n = 2, 4, 8, and so on) pixels, and then the mean gray value of the blocks are calculated as the gray value of the block. The image is divided into small blocks with n × n (n = 2, 4, 8, and so on) pixels, and then the mean gray value of the blocks are calculated as the gray value of the block.

"1". This gives an 8-digit binary number, which is usually converted to decimals for convenience.

"1". This gives an 8-digit binary number, which is usually converted to decimals for convenience.

#### (2) MB-LBP pattern value calculation

For each block, compare its gray value to each of its eight neighbors clockwise. When the gray value of center block is 50% greater than the gray value of neighbor block, write "0". Otherwise, write "1". This gives an 8-digit binary number, which is usually converted to decimals for convenience.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 9 of 14

(3) Feature vector generation

Now, we have binary numbers of all blocks in the entire image. Then, compute the histogram of the frequency of each "binary number" occurring in the entire image. Now, we have binary numbers of all blocks in the entire image. Then, compute the histogram of the frequency of each "binary number" occurring in the entire image.

An 8-digit binary number has 2<sup>8</sup> = 256 different numbers at most. Therefore, this histogram can be seen as a 256-dimensional feature vector. An 8-digit binary number has 2<sup>8</sup> = 256 different numbers at most. Therefore, this histogram can be seen as a 256-dimensional feature vector.

Such a large number of patterns will result in a long calculation time. Furthermore, a large pattern number will lead to a sparse feature vector that is difficult to compute. To solve these two problems, referring to the LBP uniform pattern [24], we proposed the MB-LBP uniform pattern. This idea is motivated by the fact that some binary patterns occur more commonly in texture images than in others [25]. These common binary patterns are called a uniform pattern if the binary pattern contains at most two 0–1 or 1–0 transitions. For example, 00010000 (two transitions) is a uniform pattern, 01010100 (six transitions) is not. Other uncommon binary patterns are called a non-uniformed pattern if it contains more than two transitions. The maximum MB-LBP uniform patterns is 59, much smaller than 256 of the original MB-LBP. Such a large number of patterns will result in a long calculation time. Furthermore, a large pattern number will lead to a sparse feature vector that is difficult to compute. To solve these two problems, referring to the LBP uniform pattern [24], we proposed the MB-LBP uniform pattern. This idea is motivated by the fact that some binary patterns occur more commonly in texture images than in others [25]. These common binary patterns are called a uniform pattern if the binary pattern contains at most two 0–1 or 1–0 transitions. For example, 00010000 (two transitions) is a uniform pattern, 01010100 (six transitions) is not. Other uncommon binary patterns are called a nonuniformed pattern if it contains more than two transitions. The maximum MB-LBP uniform patterns is 59, much smaller than 256 of the original MB-LBP.

Figure 14 shows the comparison of the histogram between the original MB-LBP and MB-LBP uniform pattern. Figure 14 shows the comparison of the histogram between the original MB-LBP and MB-LBP uniform pattern.

**Figure 14.** The comparison of the histogram between the original LBP and LBP uniform pattern. **Figure 14.** The comparison of the histogram between the original LBP and LBP uniform pattern.

(4) The generated feature vectors are sent to the SVM for classification and identification

(4) The generated feature vectors are sent to the SVM for classification and identification The original SVM is a binary classifier. In this paper, the one-vs-one multi-classification SVM method [26] was adopted. Every two classes construct a SVM, and there are a total of n(n − 1)/2 SVM. The parameter n is the number of classes. In this paper, n = 5 and there were 5(5−1)/2 = 10 SVM in The original SVM is a binary classifier. In this paper, the one-vs-one multi-classification SVM method [26] was adopted. Every two classes construct a SVM, and there are a total of n(n − 1)/2 SVM. The parameter n is the number of classes. In this paper, n = 5 and there were 5(5−1)/2 = 10 SVM in total. The specific construction process is as follows:


optimal classification surface of all classifiers.

total.

(c) Train the classifier between class *i* and *j*; (d) Repeat steps (a)~(c) until the training between any two classes is completed. Obtain the (d) Repeat steps (a)~(c) until the training between any two classes is completed. Obtain the optimal classification surface of all classifiers.

When a test image I is classified, vote to determine which class the image I belongs to according to the minimum sum of the distances from image I to the optimal classification surface. *Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 10 of 14

#### 4.1.4. Samples Sets of Experiment 4.1.4. Samples Sets of Experiment

The surface defect samples of the plates used in this paper were collected from a surface defect detection system in a steel company. The type composition of the sample sets is shown in Table 1. The surface defect samples of the plates used in this paper were collected from a surface defect detection system in a steel company. The type composition of the sample sets is shown in Table 1.


**Table 1.** Type composition of surface defect sample sets. **Table 1.** Type composition of surface defect sample sets.

#### 4.1.5. Parameter Selection of MB-LBP 4.1.5. Parameter Selection of MB-LBP

There is a certain fluctuation in the recognition accuracy, as shown in Figure 15. When the block size increased from 2 × 2 to 4 × 4, the accuracy increased, and when it increased to 8 × 8, it decreased. This is because the block enlargement was equivalent to image compression, and too large a block size lost too much defect feature information, so the accuracy decreased. Through analysis, this experiment selected 4 × 4 block size for calculation. There is a certain fluctuation in the recognition accuracy, as shown in Figure 15. When the block size increased from 2 × 2 to 4 × 4, the accuracy increased, and when it increased to 8 × 8, it decreased. This is because the block enlargement was equivalent to image compression, and too large a block size lost too much defect feature information, so the accuracy decreased. Through analysis, this experiment selected 4 × 4 block size for calculation.

**Figure 15.** Variation of time consumption and recognition accuracy with a change of block size n. **Figure 15.** Variation of time consumption and recognition accuracy with a change of block size n.

#### 4.1.6. Hardware Configuration of Experimental Platform

4.1.6. Hardware Configuration of Experimental Platform All algorithms were tested in the same platform. The hardware configuration was as follows: All algorithms were tested in the same platform. The hardware configuration was as follows: CPU Intel i5-490, main frequency 3.3 GHz, memory 8 GB.

#### CPU Intel i5-490, main frequency 3.3 GHz, memory 8 GB. *4.2. Comparison of Experiment Result between MB-LBP and Other Algorithms*

*4.2. Comparison of Experiment Result between MB-LBP and Other Algorithms* To validate the performance of the MB-LBP, a comparison with several other algorithms was made from the perspective of classification accuracy and time efficiency. The classification accuracy is the percentage of identified defects to the total defects. The time efficiency is the required time to To validate the performance of the MB-LBP, a comparison with several other algorithms was made from the perspective of classification accuracy and time efficiency. The classification accuracy is the percentage of identified defects to the total defects. The time efficiency is the required time to compute the same size image on the same platform.

compute the same size image on the same platform. In this paper, the scale invariant feature transform (SIFT), the speeded up robust feature (SURF), the gray-level co-occurrence matrix (GLCM), and convolutional auto encoder-semi-supervised generative adversarial networks (CAE-SGAN) were selected for the comparison. The SIFT [27] algorithm is widely used in the fields of image matching, image recognition, and defect detection. It In this paper, the scale invariant feature transform (SIFT), the speeded up robust feature (SURF), the gray-level co-occurrence matrix (GLCM), and convolutional auto encoder-semi-supervised generative adversarial networks (CAE-SGAN) were selected for the comparison. The SIFT [27] algorithm is widely used in the fields of image matching, image recognition, and defect detection. It constructs a Gaussian scale space to describe the local features for the purpose of scale invariability,

algorithm, which has not computed the Gaussian scale space. On the basis of retaining the main

rotation invariability, and illumination invariability. This algorithm has a certain stability on overcoming the influence of noise and affine. The SURF [28] algorithm is improved based on the SIFT algorithm, which has not computed the Gaussian scale space. On the basis of retaining the main information of the feature points, SURF employs the box filtering technique to simulate the Gaussian scale space to improve the feature extraction speed. As a classic texture feature descriptor, GLCM [29] reflects the comprehensive information of a grayscale image with respect to direction, adjacent interval, and transform amplitude, and is the basis for analyzing the local pattern structure and arrangement rules of the images. The dimension of the feature vector extracted by GLCM is 256 × 256, hence the computation amount is too large. Aiming at this problem, the KLPP [30] was adopted to reduce the dimensions of the feature vector. CAE-SGAN [9] is a deep learning method based on convolutional auto encoder (CAE) and semi-supervised generative adversarial networks (SGAN). It first trains a stacked CAE through massive unlabeled data. After CAE is trained, the encoder network of CAE is reserved as the feature extractor and is fed into a softmax layer to form a new classifier. SGAN is introduced for semi-supervised learning to further improve the generalization ability of the new method. Same as the other deep learning methods, this method uses the original defect images as the network input.

The surface defect detection system for steel plates is an online real-time detection system, time efficiency is its key indicator. According to experience, for an online detection system, in order to ensure real-time requirements, the maximum time efficiency is 100 ms/image. On this basis, the recognition accuracy of the algorithm is as high as possible. For practicality, the accuracy should be at least 90%.

Table 2 shows the comparison of the recognition accuracy and time efficiency between the MB-LBP and other algorithms. The time efficiency of SIFT, SURF, and GLCM speed was separately 583 ms, 142 ms, and 248 ms, which is too slow to meet the requirements of online real-time detection. The time efficiency of LBP was 41 ms, which not only meets the real-time requirements, but is also the fastest algorithm. However, its recognition accuracy was only 89.05%, lower than the minimum practical requirements. The time efficiency of MB-LBP was 63 ms. Although it was slightly slower than the LBP, it still meets the real-time requirement. Additionally, it had the highest recognition accuracy, reaching 94.40%. As for CAE-SGAN, as a deep learning method, it had the highest recognition accuracy of 96.21%. However, it uses the original image as input, and the dimension of input vector was as high as 224 × 224. Additionally, its network structure is too complicated. Therefore, its time efficiency was only 434 ms/image, much higher than the minimum requirement (less than 100 ms/image) of online real-time detection in this paper.


**Table 2.** Comparison of the recognition accuracy and time efficiency between the MB-LBP and other algorithms.

Table 3 shows the recognition accuracy of each class of defects with the MB-LBP. The recognition accuracy of all defects met the requirement of online detection (at least 90%). The recognition accuracy of most classes was above 94%, except for scales. This is because scales have complex morphological features and are difficult to classify. Recognition of scales is also one of the future improvement directions.


**Table 3.** Recognition accuracy of each class of defects with the MB-LBP. **Table 3.** Recognition accuracy of each class of defects with the MB-LBP.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 12 of 14

Figure 16 shows some examples of false matching. Cracks vary in size and shape, in particular, the detection of small cracks is difficult. Figure 16(a1,a2) show false-matching cracks, as they were too small and not obvious enough. Scratches were relatively easy to detect, but were greatly affected by light. Figure 16(b1,b2) show false-matching scratches. Figure 16(b1) was misclassified as a crack as it shows a thin dark stripe with weak light; and Figure 16(b2) was misclassified as a scale, as it shows an irregular area shape with strong light. Indentations vary in size and direction. Figure 16(c1,c2) show false-matching indentations. Two indentations with different sizes and directions were classified as different types of defects. Pits have irregularly shaped edges and complex backgrounds, like scales. Figure 16(d1,d2) show false-matching pits that were misclassified as scales. Scales have complex shapes and great grayscale changes, as such, they are the main misclassification defects. Scales have the lowest recognition accuracy, many other types of defects are misclassified as scales, or vice versa. Figure 16(e1,e2) show false-matching scales. Figure 16(e1,e2) were separately misclassified as a pit and crack due to their similar shapes. Figure 16 shows some examples of false matching. Cracks vary in size and shape, in particular, the detection of small cracks is difficult. Figures 16a1,a2 show false-matching cracks, as they were too small and not obvious enough. Scratches were relatively easy to detect, but were greatly affected by light. Figures 16b1,b2 show false-matching scratches. Figure 16b1 was misclassified as a crack as it shows a thin dark stripe with weak light; and Figure 16b2 was misclassified as a scale, as it shows an irregular area shape with strong light. Indentations vary in size and direction. Figures 16c1,c2 show false-matching indentations. Two indentations with different sizes and directions were classified as different types of defects. Pits have irregularly shaped edges and complex backgrounds, like scales. Figures 16d1,d2 show false-matching pits that were misclassified as scales. Scales have complex shapes and great grayscale changes, as such, they are the main misclassification defects. Scales have the lowest recognition accuracy, many other types of defects are misclassified as scales, or vice versa. Figures 16e1,e2 show false-matching scales. Figures 16e1,e2 were separately misclassified as a pit and crack due to their similar shapes.

(**e1**) false-matching scale (**e2**) false-matching scale

**Figure 16.** Examples of misclassified defects of cracks, scratches, indentations, pits, and scales. **Figure 16.** Examples of misclassified defects of cracks, scratches, indentations, pits, and scales.

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 13 of 14

#### **5. Conclusions 5. Conclusions**

Online detection of the surface defects of steel plates requires an algorithm that simultaneously satisfies fast recognition speed and high recognition accuracy. Online detection of the surface defects of steel plates requires an algorithm that simultaneously satisfies fast recognition speed and high recognition accuracy.

In this paper, we proposed a surface defect detection algorithm for steel plates, which adopted the MB-LBP algorithm to extract the defect features. The MB-LBP algorithm divides the image into small blocks. After the binarization and MB-LBP value calculation, grayscale histograms were generated as the feature vectors of the image. In this paper, we proposed a surface defect detection algorithm for steel plates, which adopted the MB-LBP algorithm to extract the defect features. The MB-LBP algorithm divides the image into small blocks. After the binarization and MB-LBP value calculation, grayscale histograms were generated as the feature vectors of the image.

To verify the performance of the MB-LBP algorithm, a comparison with several other algorithms was made. The experimental results show that the recognition accuracy of the MB-LBP algorithm was better than the other algorithms, and the time efficiency was fast enough to meet the real-time requirements of the online surface defect detection of steel plates. To verify the performance of the MB-LBP algorithm, a comparison with several other algorithms was made. The experimental results show that the recognition accuracy of the MB-LBP algorithm was better than the other algorithms, and the time efficiency was fast enough to meet the real-time requirements of the online surface defect detection of steel plates.

**Author Contributions:** Conceptualization, Y.L.; Data curation, Y.L.; Funding acquisition, K.X.; Investigation, Y.L.; Methodology, Y.L.; Project administration, J. X.; Resources, K. X.; Supervision, K. X.; Validation, J.X.; **Author Contributions:** Conceptualization, Y.L.; Data curation, Y.L.; Funding acquisition, K.X.; Investigation, Y.L.; Methodology, Y.L.; Project administration, J.X.; Resources, K.X.; Supervision, K.X.; Validation, J.X.; Writing—original draft, Y.L.; Writing—review & editing, K.X.

Writing – original draft, Y.L.; Writing – review & editing, K. X. **Funding:** This research was funded by the National Natural Science Foundation of China, grant numbers **Funding:** This research was funded by the National Natural Science Foundation of China, grant numbers 51674031 and 51874022.

51674031 and 51874022. **Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Conflicts of Interest:** The authors declare no conflicts of interest. **References**

73.


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Measurement of Period Length and Skew Angle Patterns of Textile Cutting Pieces Based on Faster R-CNN**

## **Lei Geng 1,2 , Qinglei Meng 1,2, Zhitao Xiao 1,2,\* and Yanbei Liu 1,2**


Received: 18 June 2019; Accepted: 25 July 2019; Published: 26 July 2019

**Abstract:** The skew angle and period length of the multi-period pattern are two critical parameters for evaluating the quality of textile cutting pieces. In this paper, a new measurement method of the skew angle and period length is proposed based on Faster region convolutional neural network (R-CNN). First, a dataset containing approximately 5000 unique pattern images was established and annotated in the format of PASCAL VOC 2007. Second, the Faster R-CNN model was used to detect the pattern to determine the approximate location of the pattern (the position of the whole pattern). Third, precise position of the pattern (geometric center points of pattern) are processed based on the approximate position results using the automatic threshold segmentation method. Finally, the four-neighbor method was used to fill the missing center points to obtain a complete center point map, and the skew angle and period length can be measured by the detected center points. The experimental results show that the mean average position (mAP) of the pattern detection reached 84%, the average error of the proposed algorithm was less than 5% compared with the error of the manual measurement.

**Keywords:** faster R-CNN; cutting pieces; multi-period pattern; skew angle; period length

## **1. Introduction**

Textile cutting pieces [1], as semi-finished products, have been widely used in car seats and garments areas. Most of the finished products are stitched from these pieces, and the performance of the pieces (see Figure 1a) is a key to determining the quality of the finished product. The quality of the textile pieces depends largely on their preformed geometry structure, such as the period length and skew angle of the pattern. The skew angles θ*weft* and θ*warp* are defined as the angle between the line along the horizontal or vertical period direction of the pattern and the overall contour of the piece (see Figure 1b). The skew angle is a critical parameter of multi-period pattern pieces, for it can affect the overall regularity of the pattern. The period length *Thi* and *Tvi* (*i* = 1, 2 . . . ) are defined as the length of one complete pattern period distance in the horizontal and vertical directions, can be used to infer the local regularity of the pattern (see Figure 1b). These two parameters can reflect the design difference between the pattern sample and the standard template, so they can be used as a criterion for judging the quality of the pattern.

**Figure 1.** Textile cutting piece of car seat and pattern parameters. (**a**) The overall outline of the cutting piece with strip-shaped pattern in the global perspective. (**b**) The local part of (a) where *T<sup>h</sup>* is weft period length, *T<sup>v</sup>* is warp period length. θ*weft* is weft skew angle, and θ*warp* is warp skew angle.

The two parameters are used to check whether the cutting pieces are qualified or not in the industrial area. At present, the manual method is still the main measurement way, which is time and manpower-consuming, and due to the large amount of pattern types, only limited numbers of cutting pieces are sampled. In addition, a cutting piece with complicated patterns cannot be effectively detected by the human eyes, and this phenomenon often causes quality problems.

The periodicity of the pattern has great research significance for the pattern fabric, and is the basis for measuring the skew angle and period length of the multi-period pattern. Therefore, the period extraction of the pattern becomes the key and difficult point in measuring the parameters of a pattern. In recent years, with the rapid development of image processing technology, many approaches have been proposed for fabric periodic research. In general, these approaches can be classified into three groups: grey level co-occurrence matrix-based (GLCM) [2]; distance matching function-based (DMF) [3]; and image autocorrelation function-based [4]. The method based on GLCM is a common technique in statistical image analysis that is used to estimate image properties related to second-order statistics. Li [5] and his colleagues research the variation of the eigenvalues of four grey level co-occurrence matrices to determine the period characteristics of texture, and achieved relatively good results. Xiao, et al. [6] calculated the correlation coefficient between different regions enclosed by fabric yarns based on the grey level co-occurrence matrix method to complete the segmentation of a striped fabric. The features calculated by the co-occurrence matrix can be used for periodic detection of finite-size pattern images and the computation speed is relatively fast. However, since the quantization angle and distance are frequently used to reduce the computation time when calculating the co-occurrence matrix features, the accuracy of texture cycle extraction is significantly reduced. The method based on DMF can directly use the grey value of the texture to find the texture period and requires less computation time than the traditional co-occurrence matrix approach. Jing [7] determined the period of the printed fabrics by calculating the maximum value of the second forward difference of the two-dimensional DMF. Zhou [8] implemented an automatic measurement of the texture period of woven fabric images by combining frequency domain analysis with a distance matching function and improved the stability and computational efficiency of cycle measurement. The distance matching function is an effective method for extracting pattern period. For images of any size, the distance matching function has a faster calculation speed than the traditional co-occurrence matrix method. It is suitable for patterns of finite size. However, when the brightness and shape of the periodic pattern are inconsistent, the distance matching function cannot effectively extract the pattern period. The method based on an autocorrelation function calculates the correlation coefficient of the texture by the autocorrelation function of the image to analyze the periodicity of the texture. Wu [9] calculated the autocorrelation function of the texture edge to determine the matrix of the autocorrelation function, and

then extracted the periodic and directional features of the texture. The image autocorrelation function method is easy to implement and has strong adaptability. However, the pattern period detection method based on the autocorrelation function can only reflect the periodic features of the pattern and has no other features. Moreover, it cannot efficiently acquire periodic information of a periodic pattern that has a sparsely distributed and large size in the image. Similarly, the measurement of the surface braiding angle and pitch length of the three-dimensional braided composite was realized by the corner detection-based method [10–12]. However, this algorithm based on corner detection is not suitable for the measurement of multi-period pattern parameters. When faced with complicated patterns, the corner points detected by the corner detection algorithm are disordered and the pseudo corner points are too many, and the center point of the pattern cannot be accurately found, so that the pattern period cannot be effectively extracted.

In this paper, a new measurement method based on Faster region convolutional neural network (R-CNN) [13] was proposed to measure the parameters of the multi-period pattern. At present, Faster R-CNN has been applied in many fields, such as license plate detection [14], scene text detection [15] and optical image detection [16], and achieved excellent results with its powerful performance. As Faster R-CNN has the advantages of high object detection accuracy, fast speed and strong adaptability, we used Faster R-CNN as the pattern detector to locate the pattern and extract the pattern period. The contributions of this paper are as follows:


#### **2. Methods**

In this section, the period length and pattern skew angles of textile cutting pieces were measured based on Faster R-CNN. Firstly, original images were acquired and the pattern dataset were created. Then, the patterns were detected by a model trained by Faster R-CNN net. Secondly, the approximate location of the pattern (the position of the whole patterns) were obtained based on the detected pattern. Thirdly, the precise positions of the pattern (geometric center points of pattern) were detected based on the approximate position results using the automatic threshold segmentation method. Missing center points were filled based on the four-neighbor-method to obtain a complete center point map. Finally, the skew angle and period length were measured based on the detected center points.

#### *2.1. Image Acquisition and Pattern Dataset Creation*

In this study, the image acquisition system was composed of a dome light source, a 1.3 megapixel color industrial camera, an LCD backlight and a servo motor module (see Figure 2). A vertically installed industrial camera with a camera lens overlooked the fabric. The dome light source that illuminated the fabric surface uniformly wasplaced in front of the fabric. In order to sample multiple parts of the piece, the system contained a servo mobile module that could move industrial cameras in a flat range. Figure 3 shows the six types of pattern images *F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3* (the size was 1024 × 1280 pixels) acquired by the image acquisition system, where: *F<sup>i</sup>* is the pattern with irregular shape. *F<sup>r</sup>* is the pattern with circular shape. *F<sup>w</sup>* is the pattern with the wavy shape. *Fs1*, *Fs2*, *Fs3* are patterns with a strip shape. The acquired images were transmitted to the data processing system (see

Figure 2) to be processed by the Faster R-CNN based algorithm. The image data processing is shown in Section 2.3.

**Figure 2.** Image processing system. The image acquisition system on the left was used to acquire the partial texture image of the pattern, and the acquired images were three-channel RGB images and the size was 1024 × 1280 pixels. The acquired images were transmitted to the data processing system on the right for data processing.

**Figure 3.** The sample images of different kinds of pattern piece. (**a**) irregular-shaped pattern *F<sup>i</sup>* , (**b**) circular-shape pattern *Fr*, (**c**) wavy-shaped pattern *Fc*, (**d**) strip-shaped pattern *Fs1*, (**e** )strip-shaped pattern *Fs2*, (**f**) strip-shaped pattern *Fs3.*

In this paper, 5000 unique images with size of 640 × 512 × 3 pixels were contained in the pattern dataset, and each image contained approximately 20 to 50 patterns. These images were cropped from approximately 400 original images with size of 1280 × 1024 × 3 pixels. To obtain the best training effect, each image was labeled in detail. The key details for labeling each image were as follows:


• There are no overlapping regions between the bounding boxes and the dimensions of each bounding box remain the constant.

#### *2.2. Training of Pattern Detection Model*

In recent years, object detection technology has achieved rapid development, and the object detection network based on deep learning has greatly improved the ability of object detection. At present, there are two main methods: one depends on region proposal, such as R-CNN (region convolutional neural network) [17], Fast R-CNN [18], Faster R-CNN [13] and R-FCN [19]; the other does not rely on region proposal and directly estimates candidate object recommendations, such as SSD [20] and YOLO [21–23] family.

After R-CNN [17] and Fast R-CNN [18], Microsoft's Shaoqing Ren proposed Faster R-CNN [13] to optimize the running time of the detection network. The region proposal network (RPN) was proposed to generate the proposal region. RPN replaces the previous methods such as Selective Search [24] and EdgeBoxes [25] and it shares the convolution feature of the full map with the detection network so that the region proposal detection takes very little time. The Faster R-CNN structural framework consists of RPN + Fast R-CNN. The RPN network is mainly used to generate high-quality proposal region boxes, and Fast R-CNN is used to learn high-quality proposal region features and classify objects. The overall framework of Faster R-CNN is shown in Figure 4.

**Figure 4.** Faster region convolutional neural network (R-CNN) overall framework. (**a**) The streamlined flow chart of the Faster R-CNN framework [26]. (**b**) The detailed flow chart of the Faster R-CNN [13].

Faster R-CNN proposes the region proposal network and improves the efficiency of object detection. This provides feasibility for detecting multi-period patterns with Faster R-CNN. Three nets (ZF-Net [27], VGG16 [28] and Rse-Net-101 [29]) were respectively used as the pre-trained model of Faster R-CNN, where the pattern dataset contained 5382 patterns and a total of six types of patterns: *Fi* , *F<sup>r</sup>* , *Fw*, *Fs1*, *Fs2* and *Fs3* (see Figure 3). The number of various pattern images was 894, 899, 902, 892, 904 and 891. Image size was 512 × 640 pixels. The pattern dataset was randomly divided into validation set, test set and train set according to the ratio of 2:2:6. Then, the divided datasets were used for training of Faster R-CNN (ZF-Net), Faster R-CNN (VGG16) and Faster R-CNN (ResNet-101), respectively. The experimental platforms included Windows 7, GPU GTX1080ti, Matlab 2014a and Visual Studio 2013, and the whole experiment was based on the deep learning framework Caffe.

The compared results are shown in Table 1, where the performance of the three pattern detection models can be seen. Precision represented the detection accuracy of the pattern. Balanced accuracy (Ba) was used to evaluate balanced accuracy of the pattern dataset. Kappa (K) was used to evaluate the accuracy of the pattern classification. Mean average precision (mAP) was the main indicator for evaluating the main detection results, because mAP was the actual metric for object detection.


**Table 1.** Evaluation of the Faster R-CNN with different pre-trained nets.

From Table 1, it can be concluded that the precision, K, Ba and mAP of the Resnet-101 net as the pre-trained model were higher than the other two nets, and thus Resnet-101 was chosen as the pre-trained model in this paper.

Figure 5 shows the detection results of the six patterns (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fw*, *Fs1*, *Fs2* and *Fs3*) on the Faster R-CNN model. The boxes of different colors represent the different pattern categories detected by the model; the upper left corner of the box represents the classification result, pattern category and category score for the object of the box region. Since the patterns *Fs1*, *Fs2*, *Fs3* and *F<sup>r</sup>* had the characteristics of large pattern pitch, small volume and regular shape, we chose to completely surround the pattern with the bounding box. The patterns *F<sup>i</sup>* and *F<sup>c</sup>* were irregular shapes and could not be completely surrounded by the bounding box, so we regarded a part of the pattern having the periodic characteristics as the detection object. It can be concluded from Figure 5 that the Faster R-CNN model could effectively detect six types of multi-period patterns and had fewer false positive and missing alarms.

Figure 6 shows the precision–recall (P–R) curve of the Faster R-CNN pattern detection model. The precision is the vertical axis, and the recall is the horizontal axis. The area value enclosed by the curve represents the mAP. It can be concluded from the Figure 6 that the pattern detection model had high accuracy, recall rate and average precision, so this model had pretty good pattern detection ability and excellent detection accuracy.




**Figure 5.** Detection results of the six types of patterns on the Faster R-CNN model. (**a**) Detection result of *F<sup>i</sup>* . (**b**) Detection result of *Fr*. (**c**) Detection result of pattern *Fc*. (**d**) Detection result of *Fs1*. (**e**) Detection result of *Fs2*. (**f**) Detection result of *Fs3*.

**Figure 6.** Precision–recall (P–R) curve of pattern detection by Faster R-CNN. The longitudinal axis indicates the detection precision, and the horizontal axis indicates the recall ratio. The area enclosed by the curve represents the mean average precision of the pattern detection.

#### *2.3. Centre Point Extraction*

The center point of the pattern is defined as the center of the region enclosed by the pattern outline. The area is defined as the number of pixels of a region. The center is calculated as the mean value of the line or column coordinates, respectively, of all pixels. The proposed method detected the approximate position of the pattern using the Faster R-CNN, and then an automatic threshold method was used to divide the pixel points of the pattern region and calculate the center coordinates. The steps are as follows:

**Step 1**: Image cropping method with overlapping areas is used for image cropping. The original image (Figure 7a) with size of 1208 × 1024 pixels is cropped into several sub-images (Figure 7b–e) of 640 × 512 pixels. The moving step length of the image cropping is approximately twice the length of the pattern period.

**Step 2**: The Faster R-CNN model is used to detect the pattern and output the classification score and categories. (Figure 7f–i).

**Step 3**: Merge sub-image *Fsubi* according the coordinates of image cropping, obtaining image *Fnew* (see Figure 8b). For example, define the coordinates of sub-image as (*xsub*, *ysub*), and the coordinates in image *Fnew* as (*xori*, *yori*), so the coordinates (*xori*, *yori*) are computed as follows:

$$(\mathbf{x}\_{ori}, y\_{ori}) = (\mathbf{x}\_{sub} + s\_{\mathbf{x}} \times (i - 1), y\_{sub} + s\_{y} \times (j - 1)) \tag{1}$$

where *s<sup>x</sup>* and *s<sup>y</sup>* are, respectively, the horizontal moving step length and the longitudinal moving step length. The variables *i* (*i* = 1, 2, 3 . . . ) and *j* (*j* = 1, 2, 3 . . . ), respectively, represent the times of horizontal and vertical cropping. Then combine the overlapping bounding boxes into one large bounding box according the maximum coordinates of overlapping bounding boxes, getting image *Fnew1* (see Figure 8c).

**Step 4**: Correct the inaccurate bounding boxes and calculate the center points of the patterns to get the original center point map. First, the grey distribution information of the original image is counted by the grey histogram and the average grey value *Gth* is calculated. Second, *Gth* is used as a threshold to segment the patterns in the bounding boxes and calculate the area *S<sup>i</sup>* (*i* = 1, 2 . . . ) of segmented pattern, the average value *Sav* of *S<sup>i</sup>* , the center point *P<sup>p</sup>* of the pattern, and the center point *P<sup>b</sup>* of the bounding box. Third, compare the size of *S<sup>i</sup>* and *Sav*, and remove the bounding boxes which *S* ≤ *Sav*. Fourth, adjust the positions of bounding boxes by moving the bounding boxes toward *P<sup>p</sup>* to make *S* > *Sav*. Finally, segment the patterns *f* in the corrected bounding boxes using the threshold segmentation method and calculate them center points to obtain the original center point map (see Figure 9a).

$$f(\mathbf{x}, y) = \begin{cases} 1 & f(\mathbf{x}, y) > G\_{th} \\ 0 & f(\mathbf{x}, y) \le G\_{th} \end{cases} \tag{2}$$

**Step 5**: Missing center points are filled based on four-neighbor-method to obtain the final center point map (see Figure 9c). First, approximately weft period length *Thx* and approximately warp period length *Thy* are counted from the original center point map. Second, the missing center point is between two known adjacent points *A* and *B*. If (*k* + 1/2) < *Thx* < *d<sup>m</sup>* < (*k* + 3/2) *Thx*, where *k* = 1, 2 . . . , and *d<sup>m</sup>* is the distance between two adjacent corners *A* and *B*, then fill in k missing center points uniformly on the line *AB*. Suppose the filled point is *N*. Third, find two adjacent points *C* and *D* of *N* in the longitudinal direction. Finally, the missing point *M* (see Figure 9b) is the intersection between *L*<sup>1</sup> (the line formed by point *A* and point *B*) and *L*<sup>2</sup> (the line formed by point *C* and point *D*). Similarly, handle the cases with missing points in the vertical direction.

**Figure 7.** Intermediate process of the proposed method. (**a**) Original image. (**b**) Sub-image *Fsub1*. (**c**) Sub-image *Fsub2*. (**d**) Sub-image *Fsub3*. (**e**) Sub-image *Fsub4*. (**f**) Detection results of *Fsub1*. (**g**) Detection results of *Fsub2*. (**h**) Detection results of *Fsub3*. (**i**) Detection results of *Fsub4*.

**Figure 8.** Bounding box mapping process. (**a**) Original image. (**b**) Bounding box mapping result *Fnew*. (**c**) Bounding box merge result *Fnew1*.

**Figure 9.** The procedure of the proposed method. (**a**) Original center point map. (**b**) Missing point filling process. (**c**) Final center point map.

### *2.4. Pattern Period Length and Skew Angle Measurement*

The skew angles and period length can be measured based on the final center point map which reflects the center points distribution of the pattern.

The period lengths include the weft period length *T<sup>h</sup>* and warp period length *Tv*. As shown in Figure 10b, the detected center points were denoted by the red points *H<sup>i</sup>* , *V<sup>i</sup>* , (*i* = 1, 2, 3, . . . ). One weft period length is *T<sup>h</sup>* = *dHiHj*, where *dHiHj* is the distance between the *H<sup>i</sup>* and *H<sup>j</sup>* (*j* = *i* + 1, *i* = 1, 2, 3 . . . ). Similarly, one warp period length is *T<sup>v</sup>* = *dViVj*, where *dViVj* is the distance between the *V<sup>i</sup>* and *V<sup>j</sup>* (*j* = *i* + 1, *i* = 1, 2, 3 . . . ).

**Figure 10.** Measurement of the skew angles and period length. (**a**) Schematic diagram of pattern parameter measurement. (**b**) The pattern center points map of (**a**).

The skew angles also include the weft skew angle θ*weft* and warp skew angle θ*warp*. The θ*weft* and θ*warp* are calculated by θ*h*, θ*<sup>v</sup>* and θ*<sup>c</sup>* in Figure 10a, where θ*<sup>c</sup>* is the angle between the contour of the piece and the x-axis, measured by the measuring tool (see Figure 10a). Since this paper only studied the local pattern features of the piece, it is assumed that θ*<sup>c</sup>* is a known angle. The way to obtain θ*<sup>h</sup>* and θ*<sup>v</sup>* is shown in Figure 10b. First, the least squares method is used to fit the center points *H<sup>i</sup>* (*i* = 1, 2, 3 . . . ) in the weft direction as a weft period line *Lh*. Second, calculate the slope *K<sup>h</sup>* of the line *Lh*. Finally, θ*<sup>h</sup>* is calculated by the equation θ = arctan(*k*). Similarly, the center points *V<sup>i</sup>* (*i* = 1, 2, 3 . . . ) in the weft direction are used to obtain θ*v*. The skew angles (θ*weft* and θ*warp*) can be obtained by the following equation:

$$(\theta\_{\text{wef}}, \theta\_{\text{warp}}) = (\theta\_{\text{lt}} - \theta\_{\text{c}}, \theta\_{\text{v}} - \theta\_{\text{c}}) \tag{3}$$

#### **3. Results and Discussion**

In this section, the proposed algorithm was used to test the six types of pattern images (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3*) shown in Figure 11a–f. The center point maps reflecting the periodic characteristics of the pattern are shown in Figure 11g–l. Since θ*<sup>c</sup>* in Figure 10a is an external angle and does not affect the overall accuracy of the angle to be measured, this paper measured and evaluated the θ*<sup>h</sup>* and θ*<sup>v</sup>* shown in Figure 10a.

To evaluate the proposed algorithm, we compare the proposed algorithm with manual measurement results. Manual measurement of the period length was achieved by clicking the center points of the patterns on the computer screen. As illustrated in Figure 12b, for example, we obtained the center points *H<sup>i</sup>* , *V<sup>i</sup>* , (*i* = 1, 2, . . . ) and then calculated the weft period length as *T<sup>h</sup>* = *dHiHj*, where *dHiHj* is the distance between the *H<sup>i</sup>* and *H<sup>j</sup>* (*j* = *i* + 1, *i* = 1, 2, . . . ).And the warp period length as *T<sup>v</sup>* = *dViVj*, where *dViVj* is the distance between the *V<sup>i</sup>* and *V<sup>j</sup>* (*j* = *i* + 1, *i* = 1, 2, . . . ). *T<sup>h</sup>* and *T<sup>v</sup>* were measured twenty times, and the final result was the average of the measurements. Similarly, manual measurement of the θ*<sup>h</sup>* and θ*<sup>v</sup>* also was accomplished by clicking the center points of the patterns on the computer screen. The center points in the same period direction were fitted as a straight line and the angle between the line and the x-axis was calculated (see Figure 11a). We measured each angle twenty times and then calculated the average values to obtain the result. The standard deviation δ was used to analyze the accuracy of the manual measurements of the period length and skew angle.

**Figure 12.** Manual measurement of parameters. (**a**) Skew angle, (**b**) period length.

The expression of the standard deviation is shown as follows:

$$\delta = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \left( X\_i - \bar{X} \right)^2} \tag{4}$$

where *X* is the average value of *X*.

Table 2 shows the manual measurement standard deviation of period length of images (*F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3*), where δ*Th*α, δ*Th*β, and δ*Th*<sup>γ</sup> were the minimum, maximum and average standard deviation, respectively, of the weft period length measurement. δ*Tv*α, δ*Tv*β, and δ*Tv*<sup>γ</sup> were defined similarly for the warp period length measurement. As shown in Table 2, the standard deviation obtained by manual measurement was very small. Therefore, it was reasonable to use the manual measurement results as the evaluation standard.


**Table 2.** Standard deviation of manual measurement of period length.

Table 3 shows the period length measurement results of images (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3* in Figure 11). The method based on image autocorrelation, the method based on distance matching function and the proposed method were compared with the manual measurement results, where *T* is the period length, the subscript *h* represents the weft direction, the subscript *v* represents the warp direction, *m* stands for manual measurement method, *p* stands for the proposed method, *z* stands for autocorrelation, *d* stands for distance matching function.


**Table 3.** Period length measurement results of various methods.

Table 4 shows the relative error value between different period length measurement methods and manual measurements of images (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3* in Figure 11), where the *ehpm* represents the relative error of *Thp* with *Thm*. *ehzm* represents the relative error of *Thz* with *Thm*. *ehdm* represents the relative error of *Thd* with *Thm*. Similarly, *evpm* is the relative error of *Tvp* with *Tvm*. *evzm* is the relative error of *Tvz* with *Tvm*. *evdm* is the relative error of *Tvd* with *Tvm*. The expression of the relative error RE is shown in the following Equation (5).

$$RE = \frac{X - T}{T} \times 100\% \tag{5}$$

where *X* represents the measured value and *T* represents the actual value.


**Table 4.** Relative errors of the period length measurements for the multi-period pattern piece.

From Tables 3 and 4, we could conclude that the period length measured by the proposed method had higher accuracy than the autocorrelation-based and the distance-matching function-based method. There was also a smaller relative error between the proposed method and the manual measurement result.

Similar to the evaluation of the period length, the standard deviation was used to suggest the reliability of the manual measurement of the angles. Table 5 shows the manual measurement standard deviation of θ*<sup>h</sup>* and θ*<sup>v</sup>* of images (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3* in Figure 11), where the δθ*h*α, δθ*h*<sup>β</sup> and δθ*h*<sup>γ</sup> are the minimum, maximum and average standard deviation of the θ*h*, respectively. Similarly, δθ*v*α, δθ*v*<sup>β</sup> and δθ*v*<sup>γ</sup> are the minimum, maximum and average standard deviation of the θ*v*, respectively. From Table 5, we could conclude that the standard deviation of the manual measurements was small. Therefore, it was reasonable to use the manual measurement results as a benchmark to evaluate the measurement accuracy of the proposed measurement method.

**Table 5.** Standard deviation of manual measurement of θ*<sup>h</sup>* and θ*v*.


Table 6 shows the various methods measurement results of θ*<sup>h</sup>* and θ*<sup>v</sup>* of images (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3* in Figure 11). The method based on corner detection method and the proposed method for measuring θ*<sup>h</sup>* and θ*<sup>v</sup>* were compared with the manual measurement results, where the subscript *h* represents the weft direction, the subscript *v* represents the warp direction, *m* stands for manual measurement method, *p* stands for the proposed method, *c* stands for corner detection.

**Table 6.** Measurement results of various methods of θ*<sup>h</sup>* and θ*v*.


Table 7 shows the relative error value between different angle measurement methods and manual measurements of images (*F<sup>i</sup>* , *F<sup>r</sup>* , *Fc*, *Fs1*, *Fs2* and *Fs3* in Figure 11), where the *ehpm* represents the relative error of θ*hp* with θ*hm*. *evpm* represents the relative error of θ*vp* with θ*vm*. *ehcm* represents the relative error of θ*hc* with θ*hm*. *evcm* represents the relative error of θ*vc* with θ*vm*.


**Table 7.** Relative errors of the angle measurements for the multi-period pattern piece.

The following observations were derived from Tables 6 and 7. The proposed method for measuring θ*<sup>h</sup>* and θ*<sup>v</sup>* achieved a smaller relative error compared to manual measurements. Compared with the corner detection-based method, the proposed method had higher accuracy and more stable performance in angle measurement.

#### **4. Conclusions**

The measurement of the skew angle and the period length is a fundamental problem in the quality inspection of multi-period pattern cutting pieces. We demonstrated a solution that Faster R-CNN efficiently detected the approximate location of the pattern and the method based on threshold achieved the precise location of pattern, which achieved the measurement of the skew angle and period length with high accuracy. We believe this work opens up exciting research opportunities to use the object detection network to extract the fabric pattern period, providing a new way to study pattern periodicity and can improve the detection accuracy of the pattern parameters.

**Author Contributions:** L.G. and Q.M. wrote the paper; Z.X. and Y.L. gave guidance in experiments and data analysis.

**Funding:** This work was sponsored by the Program for Innovative Research Team in University of Tianjin (No. TD13-5034), the Tianjin Research Program of Application Foundation and Advanced Technology under grant (No. 15JCYBJC16600) and the Textile Industry Association Applied Basic Research Program of China (J201509).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Intelligent Identification of Maceral Components of Coal Based on Image Segmentation and Classification**

**Hongdong Wang <sup>1</sup> , Meng Lei 1,2 , Yilin Chen <sup>3</sup> , Ming Li <sup>1</sup> and Liang Zou 1,2,\***


Received: 19 July 2019; Accepted: 6 August 2019; Published: 8 August 2019

## **Featured Application: Maceral Analysis; Coal Processing.**

**Abstract:** An intelligent analytical technique which is able to accurately identify maceral components is highly desired in the fields of mining and geology. However, currently available methods based on fixed-size window neglect the shape information, and thus do not work in identifying maceral composition from one entire photomicrograph. To address these concerns, we propose a novel Maceral Identification strategy based on image Segmentation and Classification (MISC). Considering the complex and heterogeneous nature of coal, a two-level coarse-to-fine clustering method based on K-means is employed to divide microscopic images into a sequence of regions with similar attributes (i.e., binder, vitrinite, liptinite and inertinite). Furthermore, comprehensive features along with random forest are utilized to automatically classify binder and seven types of maceral components, including vitrinite, fusinite, semifusinite, cutinite, sporinite, inertodetrinite and micrinite. Evaluations on 39 microscopic images show that the proposed method achieves the state-of-the-art accuracy of 90.44% and serves as the baseline for future research on maceral analysis. In addition, to support the decisions of petrologists during maceral analysis, we developed a standalone software, which is freely available at https:/github.com/GuyooGu/MISC-Master.

**Keywords:** maceral components; image segmentation; coal petrography; random forest; two-level clustering

## **1. Introduction**

### *1.1. Background and Motivation*

Coal is an extremely complex heterogeneous material formed from ancient wetlands over geological processes. It consists of various organic components called macerals and a lesser amount of inorganic minerals [1,2]. Different from the minerals with homogeneous internal composition and structures, the macerals derived from coalified plant tissues have distinct physical and chemical properties, and are related to the degree of coalification. In addition, the maceral composition is an important factor in evaluating the coal seam quality. Precise identification of the maceral components has a multitude of uses across various industry sectors, including hydrogenation, combustion, carbonization and gasification [2–4].

Macerals can be categorized into three basic groups through petrographic analysis, including vitrinite from coalified woody tissue, liptinite from more decay-resistant parts of plants and inertinite from hydrogen-rich plant and decomposition products. These maceral groups are subdivided into maceral subgroups and macerals [5]. Most laboratories associated with the coke-making industry Standard Test Method for Microscopical Determination of the Maceral Composition follow the standard test methods, such as International Commission for Coal Petrology (ICCP) standard [6] and American Society for Testing and Materials (ASTM) standard D2799-13 [7], and the microscopic analysis is always performed manually for the identification of maceral components. Despite being the most widely used method, it is costly and labor-intensive to identify the maceral composition due to the complicated nature and substantial diversification of the petrographical properties of coal. Even for specialists in petrography, they may arrive at different judgements in the analysis of the same microscopic image as a consequence of the subjective factors. To address these concerns, an intelligent analytical technique which is able to automatically provide objective identification of maceral components is highly desired for the growing industrial demand.

#### *1.2. Related Work*

Automatic geological identification is becoming an increasingly important technique in various fields, such as in mining and geology. Camalan et al. presented a novel strategy to estimate the liberation spectrum from optical micrographs via random forest [8]. Lei et al. proposed an autonomous classification method of rock images via unsupervised feature learning [9]. In [10], transfer learning was employed to deal with the problem of cross-region microscopic sandstone images classification. Numerous attempts have been made on the classification of microscopic rock images and achieved great success [11–13]. Considering the heterogeneous nature of coal, automatic identification of maceral components is still a challenging task [3].

In the early stage, the analytical methods to estimate the volume proportions of coal macerals were mainly based on the gray scale value of pixels [14,15] and provided interesting results. However, the liptinite and background resin have similar gray scale values, and therefore it is difficult to separate them. In addition, different maceral components in a maceral group differ only subtly in term of the reflectance, and the existing methods merely based on gray scale values are not suitable for distinguishing maceral components. Furthermore, the gray scale values of a specific component may vary over a large range with the degree of coalification. Although the gray scale descriptions of pixels remain important, the need for more quantitative features, such as shape and texture, from photomicrographs has been recognized. With the development of machine learning and image processing, it is possible to automatically make more elaborate classifications [12,16]. Over the past several years, attempts based on machine learning techniques have shown promising results in maceral analysis.

Wang et al. utilized principal component analysis (PCA) to extract primary features from texture-related and intensity-related features, and employed Support Vector Machine (SVM) to classify maceral components of the vitrinite group [17]. Skiba et al. selected 10 textural features via PCA and developed a novel strategy for automatic identification of macerals of the inertinite group. The proposed method achieved an outstanding accuracy of 93.6% based on a group of neural networks [16]. Most of the works focus on a single maceral group. So far, attempts to provide full identification of macerals have been considerably limited. Młynarczuk and Skiba evaluated the ability of three machine learning methods for identifying three maceral groups of coal and non-organic minerals [3] . They cropped the region of interest (ROI) of 41 px × 41 px and determined the label of the central pixel. Considering the morphological gradients along with the gray level features, the proposed method achieved an accuracy of 97% in classifying maceral groups. Furthermore, they analyzed six kinds of macerals of the inertinite group and an obtained accuracy is over 91%. Pearson and CSIRO have released two automated tools to identify maceral composition, including Pearson Coal Petrography (Pearson Coal Petrography—http://www.coalpetrography.com/blog1/) and CSIRO coal grain analysis (CGS) (Coal Grain Analysis—https://www.csiro.au/en/Do-business/Commercialisation/ Marketplace/Coal-Grain-Analysis), which have been successfully commercialized. However, they do not mention the detailed technologies employed in these two tools on the corresponding websites.

Despite providing inspiring performance via machine learning-based methods, there are many issues that require more scientific breakthroughs. The motivations of the proposed method derive from the following three aspects:

First and foremost, both patch-wise classification aiming to assign a label to a given region and pixel-wise classification aiming to provide a label for the central pixel of ROI neglect the shape and the size information. For instance, the micrinites are always small in size and the cutinites are very thin [18,19]. In cases where the selected regions contain two or more groups/components of macerals, it directly affects the performance. The results of the previous methods based on fixed-size window are always observed with poor generalized ability.

Second, due to the complex and heterogeneous nature of coal, the task for identifying macerals requires more parameters describing the shape, texture and morphology. Comprehensive features along with powerful machine learning techniques are required to detect the subtle differences between maceral groups/components.

Last but not least, there is no publicly available software for identifying maceral groups/components, especially targeted for geologists without strong expertise in the machine learning and image processing domains. In addition, the existing methods focus on predicting the label for a given region or pixel, whereas they do not work in the identification of maceral components from the entire photomicrograph.

To address the above-mentioned concerns, we propose a novel framework for autonomous coal macerals identification based on image segmentation and classification (MISC). A two-level coarse-to-fine clustering strategy is implemented for image segmentation, and random forest is employed to classify maceral components from the entire microscopic image. The main contributions of the proposed framework can primarily be broken down into three aspects.


#### **2. Experiment Dataset**

The metallurgical coal samples used in the study are randomly selected from samples submitted to the laboratory of the United States Geological Survey (USGS). The selected samples were prepared through a sequence of operations, including sieving, molding and polishing. All the procedures follow protocols established by the ASTM Standard D2797 [20]. Photomicrographs are captured using a Leica DMRX microscope with a Leica DFC 480 digital camera under incident white light in oil immersion, in accordance with ASTM standard D2799-13 [7].

With the increment of coalification, the difference in term of gray levels between macerals is reduced. It will be difficult to distinguish between vitrinite and liptinite at a high degree of coalification (e.g., R0 > 1.25%). In this work, the selected coal samples are with a relatively low degree of coalification (i.e., R0 < 1%). The maceral composition of each coal sample was annotated by 5 petrographers according to ASTM Standard D2799-13, and the 39 samples out of 50 samples with consistent results were further analyzed. The resolutions of these photomicrographs are different with each other, in the range of (267–1024) × (230–768) px, with each pixel roughly corresponding to 2–4 µm. Table 1 shows all the maceral components analyzed in this study, with brief descriptions and the number of macerals (909 in total). In addition, 64 objects belonging to binder are also included in the dataset. Binder is large relative to maceral components and can hold these components together. The demonstration of each maceral is provided in Figure 1.



*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 5 of 15

**Figure 1.** Examples of maceral components and binder. Each row represents a class. They are binder, vitrinite, fusinite, semifusinite, cutinite, sporinite and inertodetrinite from top to bottom respectively. The black color areas (i.e., RGB value = 0) in each figure represent the background of the given maceral component. The micrinite is not shown here for its small size accounting for only a few pixels. **Figure 1.** Examples of maceral components and binder. Each row represents a class. They are binder, vitrinite, fusinite, semifusinite, cutinite, sporinite and inertodetrinite from top to bottom respectively. The black color areas (i.e., RGB value = 0) in each figure represent the background of the given maceral component. The micrinite is not shown here for its small size accounting for only a few pixels. vitrinite, fusinite, semifusinite, cutinite, sporinite and inertodetrinite from top to bottom respectively. The black color areas (i.e., RGB value = 0) in each figure represent the background of the given maceral component. The micrinite is not shown here for its small size accounting for only a few pixels.

#### **3. Methods 3. Methods 3. Methods**

#### *3.1. Image Segmentation Based on Two-Level Clustering 3.1. Image Segmentation Based on Two-Level Clustering*  The main differences between maceral groups are the gray scale values. Generally, the gray scale

curves corresponding to binder and inertinite in (**b**).

*3.1. Image Segmentation Based on Two-Level Clustering*  The main differences between maceral groups are the gray scale values. Generally, the gray scale values of liptinite, vitrinite and inertinite decrease successively, as demonstrated in Figure 2a. The gray scale distribution curves illustrate that there are noticeable differences among three maceral groups and the binder. However, it is difficult to define the boundaries between them. In addition, the boundaries corresponding to different photographs are different. The gray scale range of each maceral group varies with the degree of coalification. Figure 2b shows the distribution of gray scale values of binder and inertinite across four photomicrographs. It can be seen that the gray range of binder is relatively fixed, whereas that of inertinite group of 4 coal samples differs greatly. Therefore, The main differences between maceral groups are the gray scale values. Generally, the gray scale values of liptinite, vitrinite and inertinite decrease successively, as demonstrated in Figure 2a. The gray scale distribution curves illustrate that there are noticeable differences among three maceral groups and the binder. However, it is difficult to define the boundaries between them. In addition, the boundaries corresponding to different photographs are different. The gray scale range of each maceral group varies with the degree of coalification. Figure 2b shows the distribution of gray scale values of binder and inertinite across four photomicrographs. It can be seen that the gray range of binder is relatively fixed, whereas that of inertinite group of 4 coal samples differs greatly. Therefore, it is unreliable to adopt a fixed threshold to segment microscopic photographs with different coalification degrees. values of liptinite, vitrinite and inertinite decrease successively, as demonstrated in Figure 2a. The gray scale distribution curves illustrate that there are noticeable differences among three maceral groups and the binder. However, it is difficult to define the boundaries between them. In addition, the boundaries corresponding to different photographs are different. The gray scale range of each maceral group varies with the degree of coalification. Figure 2b shows the distribution of gray scale values of binder and inertinite across four photomicrographs. It can be seen that the gray range of binder is relatively fixed, whereas that of inertinite group of 4 coal samples differs greatly. Therefore, it is unreliable to adopt a fixed threshold to segment microscopic photographs with different coalification degrees.

it is unreliable to adopt a fixed threshold to segment microscopic photographs with different

(**a**) (**b**) **Figure 2.** Comparison of gray scale distributions corresponding to different maceral groups and the binder. (**a**) Gray scale distributions of maceral groups and the binder in one coal sample; (**b**) the difference in the gray scale distributions across 4 coal samples. For simplicity, we only show the **Figure 2.** Comparison of gray scale distributions corresponding to different maceral groups and the binder. (**a**) Gray scale distributions of maceral groups and the binder in one coal sample; (**b**) the difference in the gray scale distributions across 4 coal samples. For simplicity, we only show the curves corresponding to binder and inertinite in (**b**). **Figure 2.** Comparison of gray scale distributions corresponding to different maceral groups and thebinder. (**a**) Gray scale distributions of maceral groups and the binder in one coal sample; (**b**) thedifference in the gray scale distributions across 4 coal samples. For simplicity, we only show the curves corresponding to binder and inertinite in (**b**).

In order to automatically detect the boundaries of each maceral, we employed image-wise segmentation which is able to divide each microscopic image into a sequence of discrete regions, each having similar attributes. It is an essential step for further maceral composition analysis. Considering the fact that the maceral components within each group mostly are not adjacent to each other and the gray scale values are the major difference between maceral groups, we adopt the gray scale values of each pixel as the features. K-means clustering, one of the most favorable clustering techniques, is utilized for its simplicity and computational efficiency [21,22].

The main steps of K-means algorithm can be summarized as follows:

(1) Choose initial cluster centroids µ1, µ2, µ3, . . . , µ*<sup>k</sup>* ∈ *R d* randomly, where *k* represents the number of clusters and *d* represents the dimension of the feature space.

(2) Repeat until convergence:

For each pixel *p*, assign it to its nearest centroid,

$$\mathcal{L}^p := \arg\min\_j \|\mathfrak{w}^p - \mu\_j\|^2 \tag{1}$$

Update each centroid,

$$\mu\_{\boldsymbol{\beta}} := \frac{\sum\_{i=1}^{M} \mathbf{1}\{\boldsymbol{c}^{p} = \boldsymbol{j}\} \mathbf{v}^{p}}{\sum\_{i=1}^{M} \mathbf{1}\{\boldsymbol{c}^{p} = \boldsymbol{j}\}} \tag{2}$$

where *c p* represents the cluster of pixel *p*, υ *p* is a vector consisting of RGB value of pixel *p*, and *M* represents the total number of pixels in each photomicrograph, respectively.

In addition, due to the complexity of coal properties, it is difficult to achieve accurate clustering results by using single-level clustering only. Single-level clustering is difficult to provide satisfactory segmentation results. Hence, a coarse-to-fine strategy is adopted. In the coarse clustering level, the regular K-means clustering algorithm is first applied to get rough clustering results, which splits a whole image into two sub-clusters. In the fine clustering level, each of the previous clusters is partitioned again into two fine sub-clusters (i.e., maceral groups and binder).

#### *3.2. Feature Extraction*

Inspired by the way that the petrologists examine photomicrographs, we extracted three types of features for maceral identification. Table 2 lists 172 features utilized in this study, such as reflectance contrasts, shape, morphology and size, which can be categorized into geometric, grayscale and texture features [23]. The detailed descriptions of these features can refer to the papers on image pattern recognition [24,25].

**Table 2.** Feature space utilized in this study.


For instance, we employ the image moment as the shape descriptor. The moment invariants have been extensively exploited to characterize image patterns. Among various image moments, Hu's 7 invariant moments have been widely applied in a variety of applications for its invariant features on image translation, scaling and rotation [26]. They are defined as follows:

$$\begin{aligned} M\_1 &= \eta\_{20} + \eta\_{02} \\ M\_2 &= (\eta\_{20} - \eta\_{02})^2 + 4\eta\_{11} \\ M\_3 &= (\eta\_{30} - 3\eta\_{12})^2 + (3\eta\_{21} - \eta\_{03})^2 \\ M\_4 &= (\eta\_{30} + \eta\_{12})^2 + (\eta\_{21} + \eta\_{03})^2 \\ M\_5 &= (\eta\_{30} - 3\eta\_{12})(\eta\_{30} + \eta\_{12})[(\eta\_{30} + 3\eta\_{12})^2 - 3(\eta\_{21} + \eta\_{03})^2] + \\ (3\eta\_{21} - \eta\_{03})(\eta\_{21} + \eta\_{03})[3(\eta\_{30} + \eta\_{12})^2 - (\eta\_{21} + \eta\_{03})^2] \\ M\_6 &= (\eta\_{20} - \eta\_{02})[(\eta\_{30} + \eta\_{12})^2 - (\eta\_{21} + \eta\_{03})^2] + 4\eta\_{11}(\eta\_{30} + \eta\_{12})(\eta\_{21} + \eta\_{03}) \\ M\_7 &= (3\eta\_{21} - \eta\_{03})(\eta\_{30} + \eta\_{12})[(\eta\_{30} + \eta\_{12})^2 - 3(\eta\_{21} + \eta\_{03})^2] + \\ (3\eta\_{21} - \eta\_{30})(\eta\_{21} + \eta\_{03})[3(\eta\_{30} + \eta\_{12})^2 - (\eta\_{21} + \eta\_{03})^2] \end{aligned}$$

$$
\eta\_{ab} = \frac{\mu\_{ab}}{\mu\_{00}^{\rho}} , \rho = \frac{a+b}{2} + 1 \tag{4}
$$

where µ*ab* represents the central moment and η*ab* stands for the normalized central moments.

The features x17–x90 are the statistical characters related to the gray scale values of the region of interest; x91–x99 are statistics for examining texture features based on the spatial relationship of pixels [27]; x99–x113 are the gray gradient features [28]; and the remaining features x114–x172 correspond to local binary patterns encoding the texture information [29].

#### *3.3. Random Forest for Image Classification*

Random forest (RF) is an ensemble machine learning method, which consists of multiple uncorrelated decision trees. It is widely used in image classification tasks due to its high accuracy, easy parameterization and robustness against overfitting [30]. Figure 3 illustrates how the random forest model works. Given the dataset with *N* samples *D* = n (*x* 1 , *y* 1 ), . . . ,(*x l* , *y l* ), . . . ,(*x <sup>N</sup>*, *y N*) o , where *x <sup>l</sup>* = [*x l* (1), *x l* (2), . . . , *x l* (172)] and *y <sup>l</sup>* <sup>∈</sup> {1, 2, 3, 4, 5, 6, 7, 8} denote the input 172 features and the output label of sample *l*, the general idea of random forest can be described as,


As an ensemble model, random forest model fits the input data in a shorter time as each decision tree is independent, making parallel computing and modeling possible [31,32]. We also test the performance of the other five machine learning methods, including Fine Tree, Radial Basis Function kernel Support Vector Machine (RBF SVM), Weighted K-Nearest Neighbors (KNN), Linear Discriminant and Subspace KNN [33,34].

subsamples for constructing each tree.

construct a decision tree.

(1). Randomly select *N* samples with replacement from the original dataset, and obtain *N*

(2). Select features for constructing decision tree nodes from a random subset of all 172 features, and

(3). Repeat step (1) and (2) for *B* times and construct a random forest with *B* trees. The final

As an ensemble model, random forest model fits the input data in a shorter time as each decision tree is independent, making parallel computing and modeling possible [31,32]. We also test the performance of the other five machine learning methods, including Fine Tree, Radial Basis Function

prediction result is obtained by the majority vote of the trees in the forest.

**Figure 3.** The scheme of random forest algorithm. The final prediction is obtained by taking a majority vote of the predictions from all the trees in the forest. **Figure 3.** The scheme of random forest algorithm. The final prediction is obtained by taking a majority vote of the predictions from all the trees in the forest.

#### *3.4. Evalutation Criteria 3.4. Evalutation Criteria*

The results of the automated segmentation method are quantitatively evaluated by using three popular evaluation criteria including clustering accuracy, entropy and purity. We refer to the class labels as the ground truth and the results of clustering algorithms as the clusters [35,36]. We refer to class as the ground truth and cluster as the results of clustering algorithms. Clustering accuracy is the most intuitive measure to evaluate the performance of clustering, which is defined as follows The results of the automated segmentation method are quantitatively evaluated by using three popular evaluation criteria including clustering accuracy, entropy and purity. We refer to the class labels as the ground truth and the results of clustering algorithms as the clusters [35,36]. We refer to class as the ground truth and cluster as the results of clustering algorithms. Clustering accuracy is the most intuitive measure to evaluate the performance of clustering, which is defined as follows

$$
gamma y = \sum\_{i=1}^{4} \frac{n\_{ii}}{n} \tag{5}$$

where *ii n* represents the number of common samples in cluster *i* and class *i*, *i* ∈{1, 2,..., 4}, and *n* is the size of the data set. where *nii* represents the number of common samples in cluster *i* and class *i*, *i* ∈ {1, 2, . . . , 4}, and *n* is the size of the data set.

Entropy is an information theoretic measure and is defined as Entropy is an information theoretic measure and is defined as

$$entropy = -\sum\_{i=1}^{4} \frac{n\_i}{n} \sum\_{j=1}^{4} \frac{n\_{ij}}{n\_i} \log\_2 \frac{n\_{ij}}{n\_i} \tag{6}$$

where *nij* indicates the number of common samples in cluster *i* and class *j*, *n<sup>i</sup>* stands for the number of samples in cluster *i*.

Purity is computed to measure the degree of clusters containing a single class. The purity is calculated as follows

$$purity = \sum\_{i=1}^{4} \frac{n\_i}{n} \max(\frac{n\_{ij}}{n\_i}) \tag{7}$$

#### **4. Experimental Results and Discussion** where *ij n* indicates the number of common samples in cluster *i* and class *j* , *<sup>i</sup> n* stands for the

#### *4.1. Image Segmenation* number of samples in cluster *i*. Purity is computed to measure the degree of clusters containing a single class. The purity is

The proposed segmentation strategy has been tested on 39 microscopic images taken by Leica DFC 480 digital camera, and the results in terms of accuracy, purity and entropy are given in Table 3. It can be observed that the proposed two-level K-means algorithm achieved significantly higher accuracy (90.82%), higher purity (90.82%) and lower entropy (0.6042) than the other clustering algorithms. In particular, the output result via two-level coarse-to-fine clustering consistently has better segmentation results as compared to the corresponding single-level clustering. For instance, regarding the K-means algorithm, the accuracy of the two-step strategy is 17.59% better than that of applying single-level K-means. calculated as follows 4 1 max( ) *ij <sup>i</sup> i i n n purity* <sup>=</sup> *n n* <sup>=</sup> (7) **4. Experimental Results and Discussion**  *4.1. Image Segmenation* 

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 9 of 15

4 4

*entropy*

1 1

2

= = *nn n* = − (6)

log *ij ij <sup>i</sup> i j i i*

*n n n*


**Table 3.** Quantitative assessment of automatic segmentation methods. The proposed segmentation strategy has been tested on 39 microscopic images taken by Leica

Furthermore, in this paper, one out of those tested images was selected to visualize the performance of the proposed strategy and the other five kinds of clustering methods. The segmentation results of single-level clustering and two-level clustering are compared with the ground truth segmentations provided by five petrologists for evaluation. It can be seen from Figure 4 that the boundary of the resultant segmentation images by K-means is slightly clearer as compared to Fuzzy c-means (FCM) and K-medoids clustering. The segmented images produced by the two-level clustering are sharper and much closer to the ground truth. Overall, the proposed two-level coarse-to-fine clustering strategy based on K-means has outperformed the other clustering algorithms. **Table 3.** Quantitative assessment of automatic segmentation methods. **Methods Accuracy (%) Purity** (**%) Entropy**  Fuzzy c-means 69.35 86.36 0.7291 K-medoids 68.34 89.94 0.6805 K-means 73.21 89.43 0.6748 2-level Fuzzy c-means 76.58 83.33 0.7961 2-level K-medoids 82.14 85.31 0.7306 2-level K-means 90.82 90.82 0.6042

**Figure 4.** Comparison of automated segmentation results with ground truth. From left to right, the images represent original image, ground truth, the segmentation results of single-level Fuzzy c-means (FCM), 2-level FCM, single-level K-medoids, 2-level K-medoids, single-level K-means, and 2-level Kmeans. **Figure 4.** Comparison of automated segmentation results with ground truth. From left to right, the images represent original image, ground truth, the segmentation results of single-level Fuzzy c-means (FCM), 2-level FCM, single-level K-medoids, 2-level K-medoids, single-level K-means, and 2-level K-means.

Furthermore, in this paper, one out of those tested images was selected to visualize the performance of the proposed strategy and the other five kinds of clustering methods. The segmentation results of single-level clustering and two-level clustering are compared with the ground truth segmentations provided by five petrologists for evaluation. It can be seen from Figure We also compare the results of image segmentation with the results of the fixed-size window strategy. As can be seen from Figure 5, it is feasible to detect the thin sporinites (i.e., a-1) and the granular micrinites (i.e., a-2). We can obtain the shape information for each object of interest. As to the fixed-size window strategy, 41 px × 41 px was demonstrated to be the optimal size for maceral identification [3]. However, it is unrealistic to retrieve the shape information in feature extraction. In addition, micrinites distributing through the window may only account for a small amount of the area, and therefore it might be unreliable to train the classification models based on the information provided by the whole window. Similarly, sporinites are always thin and they might be misclassified based on the

algorithms.

fixed-size window strategy. Our method is also very effective in extracting maceral composition with a large area (i.e., a-3), which is helpful to improve classification accuracy. Although the proposed strategy achieved satisfying performance in identifying the objects of interest (i.e., maceral groups), the differences in term of gray scale values between maceral components are too subtle to differentiate them. More discriminative features are required. provided by the whole window. Similarly, sporinites are always thin and they might be misclassified based on the fixed-size window strategy. Our method is also very effective in extracting maceral composition with a large area (i.e., a-3), which is helpful to improve classification accuracy. Although the proposed strategy achieved satisfying performance in identifying the objects of interest (i.e., maceral groups), the differences in term of gray scale values between maceral components are too subtle to differentiate them. More discriminative features are required.

area, and therefore it might be unreliable to train the classification models based on the information

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 10 of 15

4 that the boundary of the resultant segmentation images by K-means is slightly clearer as compared to Fuzzy c-means (FCM) and K-medoids clustering. The segmented images produced by the twolevel clustering are sharper and much closer to the ground truth. Overall, the proposed two-level coarse-to-fine clustering strategy based on K-means has outperformed the other clustering

We also compare the results of image segmentation with the results of the fixed-size window strategy. As can be seen from Figure 5, it is feasible to detect the thin sporinites (i.e., a-1) and the granular micrinites (i.e., a-2). We can obtain the shape information for each object of interest. As to the fixed-size window strategy, 41 px × 41 px was demonstrated to be the optimal size for maceral identification [3]. However, it is unrealistic to retrieve the shape information in feature extraction. In

**Figure 5.** The results of image segmentation based on (**a**) the proposed Maceral Identification strategy based on image Segmentation and Classification (MISC) and (**b**) the fixed-size window strategy. *4.2. Maceral Composition Classification*  **Figure 5.** The results of image segmentation based on (**a**) the proposed Maceral Identification strategy based on image Segmentation and Classification (MISC) and (**b**) the fixed-size window strategy. (a-0) and (b-0) denote the original image. (a-1) and (b-1) are the enlargement of sporinites, (a-2) and (b-2) are the enlargement of micrinites, (a-3) and (b-3) are the enlargement of fusinites.

#### **Table 4.** Comparing the results of random forest and the other 5 machine learning methods for identifying *4.2. Maceral Composition Classification*

maceral components (10-fold cross-validation, and repeat the experiment 10 times).  **Classification Accuracy (%) ± std (%) Classifier Geometric Features Grayscale Features Texture Features All Features**  Fine Tree 72.29 ± 0.72 70.01 ± 0.83 78.79 ± 1.02 85.71 ± 1.08 Radial basis function kernel support-vector machine 72.47 ± 0.11 76.97 ± 0.59 85.54 ± 0.33 86.62 ± 0.58 Weighted K-Nearest Neighbors 55.22 ± 0.49 71.40 ± 0.56 80.44 ± 0.47 78.69 ± 0.88 Linear Discriminant 53.76 ± 0.32 70.34 ± 0.46 82.12 ± 0.40 83.36 ± 0.60 Subspace K-Nearest Neighbors 55.45 ± 0.74 71.13 ± 0.68 79.61 ± 0.56 81.26 ± 0.47 Random Forest 79.30 ± 0.47 78.86 ± 0.34 85.64 ± 0.30 90.44 ± 0.37 Each microscopic image was divided into a series of discrete objects (i.e., macerals groups). Then we extracted geometric, grayscale, texture features from each object, and created a 172-dimensional feature vector for each object. We compared the recognition performance of random forest with five popular classification methods via a 10-fold cross validation. The experiments were repeated 10 times and the average accuracies were reported in Table 4. This summarizes the classification results of all the 973 objects obtained by image segmentation. The proposed approach yields a high accuracy of 90.44%, outperforming other classifiers. This should be regarded as a very good performance, especially when we take into account the high degree of complexity of coals. The obtained results show that the proposed strategy based on image segmentation and classification has a high potential for maceral components identification.

Each microscopic image was divided into a series of discrete objects (i.e., macerals groups). Then we extracted geometric, grayscale, texture features from each object, and created a 172-dimensional **Table 4.** Comparing the results of random forest and the other 5 machine learning methods for identifying maceral components (10-fold cross-validation, and repeat the experiment 10 times).


We further test the performance of an individual kind of features. The recognition performance using texture features is much higher than that of the other two types of features. Generally, the fusion of multiple kinds of features can achieve a significant improvement over a single kind of features, except for weighted KNN classifier.

To observe relations between the predictions of classifiers and the true labels, we also employ confusion matrices to report the results of different approaches. The confusion matrix enables us to know not only the error rates being made by a classifier but also the types of errors. More specifically, the rows of the matrix represent the predicted class, and the columns correspond to the true class (i.e., ground truth). The green diagonal cells stand for the number of correctly classified observations, while the red

for maceral components identification.

features, except for weighted KNN classifier.

cells represent the misclassified observations. The precision and the recall rate corresponding to each class are also shown at the far right of each row and the bottom of each column, respectively. The overall accuracy is shown at the bottom-right corner of the matrix. As shown in Figure 6, the proposed method provides satisfying performance for most of the maceral components. The main flaws of all these six classifiers come from the misclassification of semifusinite and inertodetrinite. The following reasons could contribute to the worse performance for these two components: semifusinite and inertodetrinite belong to inertinite group, and the difference in terms of gray level is too subtle to classify them properly [37]; semifusinite is intermediate between fusinite and vitrinite, and has a similar texture and general structure to fusinite; the origin of inertodetrinite is similar to fusinite and semifusinite [38]. corresponding to each class are also shown at the far right of each row and the bottom of each column, respectively. The overall accuracy is shown at the bottom-right corner of the matrix. As shown in Figure 6, the proposed method provides satisfying performance for most of the maceral components. The main flaws of all these six classifiers come from the misclassification of semifusinite and inertodetrinite. The following reasons could contribute to the worse performance for these two components: semifusinite and inertodetrinite belong to inertinite group, and the difference in terms of gray level is too subtle to classify them properly [37]; semifusinite is intermediate between fusinite and vitrinite, and has a similar texture and general structure to fusinite; the origin of inertodetrinite is similar to fusinite and semifusinite [38].

while the red cells represent the misclassified observations. The precision and the recall rate

*Appl. Sci.* **2019**, *9*, x FOR PEER REVIEW 11 of 15

90.44%, outperforming other classifiers. This should be regarded as a very good performance, especially when we take into account the high degree of complexity of coals. The obtained results show that the proposed strategy based on image segmentation and classification has a high potential

We further test the performance of an individual kind of features. The recognition performance using texture features is much higher than that of the other two types of features. Generally, the fusion of multiple kinds of features can achieve a significant improvement over a single kind of

To observe relations between the predictions of classifiers and the true labels, we also employ confusion matrices to report the results of different approaches. The confusion matrix enables us to know not only the error rates being made by a classifier but also the types of errors. More specifically,

**Figure 6.** Confusion matrix comparison of six classifiers, including the results of (**a**) Fine Tree; (**b**) Radial basis function kernel support vector machine; (**c**) Weighted K-Nearest Neighbors (KNN); (**d**) Linear Discriminant Analysis; (**e**) Subspace KNN and (**f**) Random Forest. The labels in each subfigure **Figure 6.** Confusion matrix comparison of six classifiers, including the results of (**a**) Fine Tree; (**b**) Radial basis function kernel support vector machine; (**c**) Weighted K-Nearest Neighbors (KNN); (**d**) Linear Discriminant Analysis; (**e**) Subspace KNN and (**f**) Random Forest. The labels in each subfigure include 1: binder, 2: sporinite, 3: cutinite, 4: vitrinite, 5: fusinite, 6: semifusinite, 7: inertodetrinite, and 8: micrinite.

The process of training a random forest involves the construction of multiple decision trees. In this study, we also evaluated the classification accuracy with the increase of the number of trees from 1 to 300. As shown in Figure 7, in general, more trees can provide the better performance, especially when the number of trees is smaller than 50. However, the improvement decreases as the number of trees increases from 50. Considering the tradeoff between the computational efficiency and the robustness of the developed model, in this study, we set the number of trees to be 200.

8: micrinite.

include 1: binder, 2: sporinite, 3: cutinite, 4: vitrinite, 5: fusinite, 6: semifusinite, 7: inertodetrinite, and

The process of training a random forest involves the construction of multiple decision trees. In this study, we also evaluated the classification accuracy with the increase of the number of trees from 1 to 300. As shown in Figure 7, in general, more trees can provide the better performance, especially when the number of trees is smaller than 50. However, the improvement decreases as the number of

**Figure 7.** The identification accuracies with different number of trees in random forest. **Figure 7.** The identification accuracies with different number of trees in random forest.

#### *4.3. The Platform of Automatic Coal Petrographic Analysis*

*4.3. The Platform of* A*utomatic Coal Petrographic Analysis*  The proposed maceral analysis method, MISC, based on image segmentation and classification makes it possible to identify the maceral composition automatically and intelligently. In order to facilitate the usage by petrologists, we developed a standalone software implemented in Matlab. We integrate the two-level K-means and various classification algorithms into the software for intelligent identification of maceral composition. Figure 8 is the screen snapshot of MISC software. Users can submit a microscopic image of coal with the degree of coalification R0 < 1.0%. The segmentation results are presented as four subfigures, corresponding to the binders and three maceral groups. The classification result for each object detected by image segmentation is shown with different colors for visualization. The MISC is freely available for users at the following website: https://github.com/GuyooGu/MISC-Master. It can be used to support the decisions of petrologists in classifying maceral components. To the best of our knowledge, it is the first non-commercial software for identification of maceral components. It is an efficient and effective tool for the complete analysis of maceral composition.

**Figure 8.** The user interface of MISC for automatic coal petrographic analysis.

A

8: micrinite.

**Figure 7.** The identification accuracies with different number of trees in random forest.

*utomatic Coal Petrographic Analysis* 

include 1: binder, 2: sporinite, 3: cutinite, 4: vitrinite, 5: fusinite, 6: semifusinite, 7: inertodetrinite, and

robustness of the developed model, in this study, we set the number of trees to be 200.

The process of training a random forest involves the construction of multiple decision trees. In this study, we also evaluated the classification accuracy with the increase of the number of trees from 1 to 300. As shown in Figure 7, in general, more trees can provide the better performance, especially when the number of trees is smaller than 50. However, the improvement decreases as the number of trees increases from 50. Considering the tradeoff between the computational efficiency and the

**Figure 8. Figure 8.** The user interface of MISC for auto The user interface of MISC for automatic coal petrographic analysis. matic coal petrographic analysis.

#### **5. Conclusions**

Inspired by the way that petrologists examine photomicrographs, we proposed an automatic and effective framework for maceral classification. The proposed strategy is fundamentally different from previous attempts to classify a region and classify the central pixel of a ROI. Based on the image segmentation, rather than the fixed-size window, the regions of interest are cropped automatically. Current fixed-size window-based strategies, including both patch-wise classification and pixel-wise classification, neglect the shape and the size information and therefore do not work in identifying maceral composition from one entire photomicrograph. The utilized two-level coarse-to-fine clustering strategy achieved significantly better segmentation results as compared to the corresponding single-level clustering. In addition, considering the complicated nature of coal, it may prove difficult to identify maceral components based on a single kind of features. Our results suggest that classification approach based on multiple kinds of features, such as geometric, grayscale and texture features, will be a promising direction for identifying maceral components.

However, it should be stressed that the research described in this paper is a preliminary study which still has some limitations. First, the proposed method only works when the degree of coalification is smaller than 1.0%. With the increase of that degree, it will be difficult to distinguish vitrinite and liptinite based on gray scale values. Second, we assume that there are four maceral groups of each sample. However, this assumption may not hold in a few cases. We have tried density-based clustering methods, such as density-based spatial clustering of applications with noise (DBSCAN), which can automatically detect the optimal number of clusters. However, the performance is not better than the proposed method. We will investigate this issue in the future. In addition, 39 photomicrographs used in this study were analyzed according to definitions in ASTM Standard D 2799-13 and atlas in [39]. There are seven types of maceral components, belonging to three maceral groups (i.e., vitrinite, liptinite and inertinite). Liptinites include sporinite and cutinite. Inertinite macerals include fusinite, semifusinite, inertodetrinite, and macrinite. However, the mineral components and the other macerals, such as funginite or macrinite, are rarely found in these photomicrographs, and are not abundant in nature. Therefore, in this study, we do not consider these components, and follow a simplified classification as that in [39]. The proposed MISC strategy obtained relatively good performance, especially when we take into account a high degree of complexity of coals. Although with some limitations, to the best of our knowledge, our work is the first study aiming to provide complete analysis of maceral composition.

**Author Contributions:** Conceptualization, H.W. and L.Z.; funding acquisition, M.L. (Meng Lei) and L.Z.; methodology, H.W. and Y.C.; project administration, M.L. (Meng Lei); resources, M.L. (Ming Li); supervision, M.L. (Ming Li) and L.Z.; validation, Y.C. and M.L. (Ming Li); writing—original draft, H.W.; writing—review and editing, L.Z.

**Funding:** This research was funded by the Fundamental Research Funds for the Central Universities with grant number 2019ZDPY17.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Segmentation of River Scenes Based on Water Surface Reflection Mechanism**

**Jie Yu, Youxin Lin , Yanni Zhu, Wenxin Xu, Dibo Hou \* , Pingjie Huang and Guangxin Zhang**

State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China; yu\_jie@zju.edu.cn (J.Y.); yxlin@zju.edu.cn (Y.L.); yanni\_z@zju.edu.cn (Y.Z.); xuwen\_xin@zju.edu.cn (W.X.); huangpingjie@zju.edu.cn (P.H.); gxzhang@zju.edu.cn (G.Z.)

**\*** Correspondence: houdb@zju.edu.cn

Received: 15 February 2020; Accepted: 30 March 2020; Published: 3 April 2020

**Abstract:** Segmentation of a river scene is a representative case of complex image segmentation. Different from road segmentation, river scenes often have unstructured boundaries and contain complex light and shadow on the water's surface. According to the imaging mechanism of water pixels, this paper designed a water description feature based on a multi-block local binary pattern (MB-LBP) and Hue variance in HSI color space to detect the water region in the image. The improved Local Binary Pattern (LBP) feature was used to recognize the water region and the local texture descriptor in HSI color space using Hue variance was used to detect the shadow area of the river surface. Tested on two data sets including simple and complex river scenes, the proposed method has better segmentation performance and consumes less time than those of two other widely used methods.

**Keywords:** river scene segmentation; local binary pattern; hue variance; surface reflection; hand-designed image descriptors

#### **1. Introduction**

Segmentation of a river scene plays an important role in many fields such as the water hazard detection of unmanned ground vehicles [1], the navigation of unmanned ships [2], river analysis or flood monitoring by remote sensing [3–6] and vision-based object monitoring on rivers. This study aims to recognize the river region in an image taken in outdoor scenes based on the water surface reflection mechanism, which is an important task in applications of intelligent video surveillance in river environments. Moreover, segmentation of the river scene is a representative case of complex image segmentation, which can serve as a reference for complex image segmentation.

For water region segmentation, researchers have explored different kinds of methods that fall into three main categories—image processing-based methods, Machine Learning-based methods (including Deep Learning, Supervised Learning, Clustering, etc.), and hardware-based methods. For image processing-based methods, Rankin et al. [7] combined the color and texture features to detect the water region according to the appearance characteristics of the river in the outdoor scene. Yao [8] used the Region Growing method firstly to separate the obvious water region based on the brightness value. Then a designed texture feature is used to perform K-Means clustering on each 9 × 9 small patch in the image, where the class with the smallest average value of texture is classified as the water region, where the detection of water region with shadow needs the aid of stereo vision. Zhao et al. [9] used the adaptive threshold Canny edge detection algorithm to detect the river boundary. The texture and structure of images are also widely used in related research in water scenes such as waterline detection [10] and maritime horizon line detection [11]. For Machine

Learning-based methods, Achar et al. [12] proposed a self-supervised algorithm to classify all the image patches in an image into water or not-water category by features of RGB, texture, and height. The results show high accuracy but this algorithm requires prior knowledge of horizon by hardware and is only applicable to images that conform to a specific structure. Moreover, with the development of Deep Learning, it has also been applied in water region segmentation. For example, Zhan et al. [13] proposed an online learning approach to recognize the water region for the USV in the unknown navigation environment using a convolutional neural network (CNN). Han et al. [14] innovatively used the Fully Connected Convolutional Network (FCN) to achieve water hazards detection on the road. Despite of high accuracy, the artificial neural network with complex structure needs to be pre-trained in many scenes before use and requires high computing power. For hardware-based methods, some studies have used various optical sensors such as laser radar [15], infrared camera [1], stereo camera [16,17], and polarized camera [16,18,19] to easily realize water hazard detection based on the optical characteristics of waters [20]. These methods are still difficult to popularize in applications due to the cost and equipment complexity.

The above methods have some defects. As Rankin [21] made the observation that the river has inhomogeneous appearance in outdoor scenes, the methods simply utilizing image features, whose underlying assumption is that river appearance is fairly uniform, remain problematic due to inhomogeneous appearance (such as shadow and changing illumination) and show bad performance. For the same reason, it is also inappropriate for Machine Learning-based methods using the global features of an image to train a classification model to segment itself. As for the hardware-based methods, they are beyond the scope of this paper.

Since image processing technology has advantages of simplicity and interpretability, this study proposes a segmentation algorithm utilizing designed image features without machine learning. To overcome the drawback that the current methods cannot well deal with inhomogeneous appearance of the river, this study designs an improved LBP feature extraction method based on the water surface reflection mechanism to detect water region in the image. A texture feature based on Hue(*H*) variance in HSI color space is also introduced for detection of shadow area. Compared with two other principle methods using image processing techniques, the proposed method consumes the least time, and in the complex river scenes where other methods failed, the proposed algorithm still shows satisfactory performance. Lastly in this study, the parameters in the proposed algorithm are discussed for better performance.

#### **2. Materials and Methods**

### *2.1. Algorithm Framework*

The qualitative imaging law of the riverine water region in an image is the basis of the algorithm designed in this paper. The overall flowchart of the proposed algorithm is shown in Figure 1.

First of all, the input image needs to be pre-processed, including image down-sampling and image blurring operation. The pre-processing operations will be discussed in Section 3.1. Secondly, the improved LBP feature and local hue variance are calculated in parallel. Then the water region with and without shadow are both obtained by threshold method. The two parts are fused as the major water region. Finally, the image morphological operation is carried out on the candidate region of the water region. After obtaining the results of the morphological processing, the largest connected domain is taken as the final water region. Judging the maximum connected domain is also an important task. It is based on the common sense that the water area often occupies a main and large part of the image, which helps to eliminate the pseudo-water patches whose features are similar to those of water patches.

**Figure 1.** Flowchart of the proposed algorithm.

#### *2.2. Light Reflection Mechanism of Water Surface*

In order to study the imaging law of water in rivers, it is necessary to first understand the general reflection mechanism of objects. According to Lambert Law, the intensity of object surface through various types of reflections reaches the image sensor is [22]:

$$I(\mathbf{x}) = \int\_{\omega} \mathbf{e}(\lambda) \rho\_k s(\mathbf{x}, \lambda) d\lambda,\tag{1}$$

where *e*(*λ*) is the color of the light source, *s*(*x*, *λ*) is the surface reflection value, *ρ<sup>k</sup>* is the sensitive function of the camera (*k* ∈ {R, G, B}), *ω* represents the visible spectral range, and *x* denotes the corresponding space coordinates. For a particular color camera, the pixel intensity values in the image are only related to the reflected light [23]. On this basis, the relationship between the pixel value of the water and the light reflection of the water surface is further expressed as follows:

$$I = LR\_{\text{total}} \tag{2}$$

where *L* is the illumination factor related to the illumination condition, *R*total is the total reflection energy. In river scenes, *R*total is mainly composed of the following four parts [21]: the energy that reflected off the water surface *R*r, that scattered by water molecules to the camera *R*o, that reflected or scattered by materials suspended in the water to the camera *R*s, and that reflected off the bottom of the water the camera *R*p:

$$R\_{\text{total}} = R\_{\text{r}} + R\_{\text{o}} + R\_{\text{s}} + R\_{\text{p}}.\tag{3}$$

Since the reflection from the water surface to the camera *R*<sup>r</sup> plays a dominant role in *R*total, that is, (2) and (3) can be further simplified as:

$$I \approx L\mathcal{R}\_{\text{r.}}\tag{4}$$

For light polarized perpendicular to and parallel to the plane of incidence, *R*<sup>r</sup> can be respectively decomposed into *<sup>R</sup>*r,⊥(*θ*) and *<sup>R</sup>*r,<sup>k</sup> (*θ*), where *θ* is the incident angle, *θ* ∈ 0, *<sup>π</sup>* 2 , as shown in (5):

$$R\_r(\theta) = \frac{R\_{\mathbf{r},\perp}(\theta) + R\_{\mathbf{r},\parallel}(\theta)}{2}. \tag{5}$$

According to Fresnel Law:

$$R\_{\mathbf{r},\perp}(\theta) = [\frac{n\_1 \cos \theta - n\_2 \sqrt{1 - \left(\frac{n\_1}{n\_2} \sin \theta\right)^2}}{n\_1 \cos \theta + n\_2 \sqrt{1 - \left(\frac{n\_1}{n\_2} \sin \theta\right)^2}}]^2. \tag{6}$$

$$R\_{\mathbf{r},\parallel}(\theta) = \frac{n\_1\sqrt{1 - \left(\frac{n\_1}{n\_2}\sin\theta\right)^2 - n\_2\cos\theta}}{n\_1\sqrt{1 - \left(\frac{n\_1}{n\_2}\sin\theta\right)^2 + n\_2\cos\theta}}\text{)}\tag{7}$$

where *n*<sup>1</sup> is the refractive index of air, *n*<sup>2</sup> is the refractive index of water, and *θ* is the angle of incidence. *n*<sup>1</sup> = 1.0 and *n*<sup>2</sup> = 1.33 are taken under ideal conditions.The water region reaches the sensor through various types of reflections, as shown in Figure 2, where *l* is the horizontal displacement of the point to the camera lens, *h* is the height at which the camera is placed (in a certain scene, the image sensor used to capture images is commonly fixed). According to the simplified scenario shown in Figure 2, *α* ≈ *θ* can be obtained from the geometric relationship, thus *R*r(*θ*) can be converted into a function *R*r(*l*) about the horizontal displacement, just make:

$$\theta \approx \arctan{\frac{l}{h}}\tag{8}$$

and then substitute (8) into (5). Since only qualitative rather than quantitative law is used in the subsequent algorithm design for the water region detection, the above equation do not need to be strictly equal. Given that the expression of the result is too complicated, and the designed algorithm only needs the qualitative law, we explored the relationship between the reflection intensity of the water and the horizontal distance to the camera by giving some different *h* values that indicate some conventional installation heights, as shown in Figure 3.

**Figure 2.** Light path diagram of river scene.

**Figure 3.** Relationship between water surface reflection and horizontal distance *l* with different heights camera.

It can be seen that the reflected energy to the image sensor from far to near is monotonically decreasing. This qualitative law of water pixels is used to design subsequent water region detection algorithm.

In addition to the above-mentioned reflection mechanism, the water surface in outdoor scenes often contains shadow caused by the occlusion of the riverside scenery. The *H* component in the HSI color space of the image is not sensitive to illumination, and can maintain a relatively stable state under illumination changes [24]. In order to calculate the feature of *H*, firstly the RGB image should be converted into an HSI one by:

$$h = \begin{cases} \ \beta, \text{g } \ge b \\ 2\pi - \beta, \text{g} < b \end{cases}, h \in [0, 2\pi] \tag{9}$$

$$s = 1 - 3 \cdot \min(r, g, b), s \in [0, 1] \tag{10}$$

$$i = \frac{r+g+b}{3}, i \in [0,1],\tag{11}$$

where *r*, *g*, *b*, *h*, *s*, *i* are all normalized values, and:

$$\beta = \arccos\left\{ \frac{[(r-g)+(r-b)]/2}{\left[ (r-g)^2 + (r-b)(g-b) \right]^{0.5}} \right\}.\tag{12}$$

This law is illustrated in Figure 4. By traversing *I* and *H* values of pixels on the specified column (indicated by the red line), the values are shown on the right of Figure 4. The result showed that for the water region without shadow, the distribution of *I* values of the pixels is closely related to the variation law shown in Figure 3. For the water region with shadow, the *H* values keep spatially stable.

**Figure 4.** Pixel-wise intensity values and hue values in a column of river scene image.

#### *2.3. Improved Local Binary Pattern Feature*

The water part in an image tends to present simpler textures. Some studies utilize this characteristic to segment water bodies. The common texture features include gray-level co-occurrence matrix (GLCM) [25], Laws' Mask [26], Local Binary Pattern (LBP) [27] and so on. The study of the water surface reflection mechanism in the previous section shows that the appearance of the water region changes spatially. Consequently, the results of the textural descriptors calculated from the whole image, such as GLCM, for water region in an image would have a distinct numerical difference. LBP constructs a local feature descriptor that reflects the magnitude relationship between the center pixel and the neighborhood ones, which can effectively deal with the inhomogeneous appearance in an image, and establish a more reliable description for an image patch. Based on the water surface reflection mechanism discussed previously, an improved LBP feature is designed to describe the spatial characteristics of water appearance and then used to detect the water part of an image.

To obtain the improved LBP feature, the image is firstly divided into several patches of a specified size, and then each patch is further divided into 9 blocks. The pixel value(or the average value) in each block is denoted as *I<sup>k</sup>* , *k* = 1, 2, 3, ..., 9 as shown in Figure 5.

**Figure 5.** Illustration of generation of image patches.

The traditional LBP feature compares the values of center block pixel *I*<sup>5</sup> with the neighborhood ones and encodes them into a binary string. However, the comparison results are in different weights for different directions. Therefore, the proposed algorithm improves traditional LBP feature, as shown in Algorithm 1. It is designed based on the qualitative law that the pixel values decrease from far to near and the pixel values at the same distance to the camera are close.

**Algorithm 1** Improvd LBP feature

**Input:** gray-scale image patch in matrix form **Output:** 8-dimension feature 1: divide the image into 9 equal-size blocks with pixel value *I<sup>k</sup>* , *k* = 1, 2, 3, ..., 9 2: **if** |*I*<sup>1</sup> − *I*2| < 1% ∗ *I*<sup>1</sup> **and** |*I*<sup>2</sup> − *I*3| < 1% ∗ *I*<sup>2</sup> **then** 3: *f*<sup>1</sup> = 1 4: **else** 5: *f*<sup>1</sup> = 0 6: **if** |*I*<sup>4</sup> − *I*5| < 1% ∗ *I*<sup>4</sup> **and** |*I*<sup>5</sup> − *I*6| < 1% ∗ *I*<sup>5</sup> **then** 7: *f*<sup>2</sup> = 1 8: **else** 9: *f*<sup>2</sup> = 0 10: **if** |*I*<sup>7</sup> − *I*8| < 1% ∗ *I*<sup>7</sup> **and** |*I*<sup>8</sup> − *I*9| < 1% ∗ *I*<sup>8</sup> **then** 11: *f*<sup>3</sup> = 1 12: **else** 13: *f*<sup>3</sup> = 0 14: **if** |*I*<sup>4</sup> − *I*1| ≈ |*I*<sup>5</sup> − *I*2| ≈ *I*<sup>6</sup> − *I*<sup>3</sup> **then** 15: *f*<sup>4</sup> = 1 16: **else** 17: *f*<sup>4</sup> = 0 18: **if** |*I*<sup>7</sup> − *I*4| ≈ |*I*<sup>8</sup> − *I*5| ≈ *I*<sup>9</sup> − *I*<sup>6</sup> **then** 19: *f*<sup>5</sup> = 1 20: **else** 21: *f*<sup>5</sup> = 0 22: **if** *I*<sup>1</sup> > *I*<sup>4</sup> > *I*<sup>7</sup> **then** 23: *f*<sup>6</sup> = 1 24: **else** 25: *f*<sup>6</sup> = 0 26: **if** *I*<sup>2</sup> > *I*<sup>5</sup> > *I*<sup>8</sup> **then** 27: *f*<sup>7</sup> = 1 28: **else** 29: *f*<sup>7</sup> = 0 30: **if** *I*<sup>3</sup> > *I*<sup>6</sup> > *I*<sup>9</sup> **then** 31: *f*<sup>8</sup> = 1 32: **else** 33: *f*<sup>8</sup> = 0 34: **return** *f* = [ *f*1, *f*2, *f*3, *f*4, *f*5, *f*6, *f*7, *f*8]

In the improved LBP calculation, the features *f*1, *f*2, and *f*<sup>3</sup> indicate that the *I* values of every row in the image patch are very close because the water pixels from a similar distance have almost the same reflected energy to the camera. While the pixel value differences in the vertical direction in the patch are numerically similar, as the meaning by *f*<sup>4</sup> and *f*5, since the distance between adjacent pixels is small enough to neglect the gap. Moreover, the father pixel theoretically has a larger pixel value than that of a closer one, as the meaning of *f*6, *f*7, and *f*8. Finally, to overcome the drawback that the relationship of different directions in the traditional LBP has different weights, the improved LBP sums the obtained Boolean results *f<sup>i</sup>* , *i* = 1, 2, 3, ..., 8 as a score *F* :

$$F = \sum\_{i=1}^{8} f\_i. \tag{13}$$

After all, an appropriate threshold *T*<sup>1</sup> is adopted to compare with the obtained score *F* to decide whether the patch is part of water or not, which can be formulated as follows:

$$\begin{cases} \text{ water, if } F \ge T\_1\\ \text{ not water, if } F < T\_1 \end{cases} \tag{14}$$

Empirically, the algorithm has satisfactory performance in most scenes when *T*<sup>1</sup> is set to 5 or 6.

#### *2.4. Local Hue Variance in HSI Color Space*

Since the shadow area may not be subject to the model of (3), after recognizing the main part of the water region, another method to recognize the water area covered by the shadow is needed to increase the recall rate of the water region's segmentation. In shadow, the lighting conditions are difficult to estimate, and the reflection law reflected by (8) is not available. However, *H* values keep uniformity within neighbor pixels as shown in Figure 4.

The calculation of the local hue variance is as follows: firstly, convert the original RGB input image block into an HSI image. Then the extracted *H* layer is divided into 9 blocks of the same size, as shown in Figure 6. Finally, calculate the mean value of *H* denoted as *H<sup>k</sup>* (*k* = 1, 2, 3. . . , 9) of each block and obtain the variance of *H<sup>k</sup>* :

$$V\_H = \frac{\sum\_{k=1}^{9} \left(H\_k - \overline{H}\right)^2}{9}.\tag{15}$$

**Figure 6.** Color space conversion and image division for local hue variance calculation.

An appropriate threshold *T*<sup>2</sup> is then adopted to compare with the obtained *V<sup>H</sup>* to identify the shadow area. The image patches that have bigger *V<sup>H</sup>* than the designed threshold are labeled as part of water, which is express as:

$$\begin{cases} \text{ water, if } V\_H < T\_2\\ \text{ not water, if } V\_H \ge T\_2 \end{cases} \tag{16}$$

Since *H<sup>k</sup>* are normalized values in the calculation, the same *T*<sup>2</sup> can be used for different images. Empirically, *T*<sup>2</sup> can be set within [1.5, 1.8] to get satisfactory performance in most scenes.

#### *2.5. Morphological Operation*

Morphological Operation [28,29] is a widely used technique for digital images. The basic idea in binary morphology is to probe an image with a simple, pre-defined shape called structuring element, drawing conclusions on how this shape fits or misses the shapes in the image. The basic operations include erosion and dilation. The erosion eliminates sporadic targets or noise, while the dilation amplifies the target area. Different size structuring elements lead to different results of Morphological Operation.

In this study, Morphological Operation is employed to eliminate potential pseudo-water patches that wrongly detected by the proposed algorithm and obtain the largest connected domain in the image as water region. Erosion is performed firstly, and then triple expansion with increasing size

192

structuring elements is carried out to ensure the integrity of the segmenting area. This process is shown in Figure 7.

**Figure 7.** Morphological operation method designed in our algorithm.

The size of the structuring element can affect algorithm performance. Empirically, it was recommended to use a rectangular structuring element slightly larger than the patch size for erosion operation, since the patch shape in pre-processing was rectangular. To make the boundary of segmentation closer to the human visual system, an elliptical structural element was then used for dilation. Moreover, to eliminate the foreground outliers during the morphology process, triple times of dilation were consecutively carried out, with increasing size of the structuring elements, which is formulated as follows:

$$\begin{cases} L\_0 = 10 \times 10 \\ L\_1 = 15 \times 15 \\ L\_2 = \frac{L\_3 + L\_1}{2} \\ L\_3 = kL\_{\text{image}} \end{cases} \tag{17}$$

where *Limage* is the size of the input image and *k* is a coefficient indicating that the size of the structuring element was determined by the size of the input image, which is further discussed in Section 3.5.

#### **3. Results and Discussion**

The proposed algorithm was tested on a dataset made up of 500 images taken from different river scenes. These scenes were divided into simple scenes(110) and complex scenes(390) to test the performance of different methods under both general and special conditions. The simple scenes in this study refer to general and common outdoor river scenes that do not contain complex issues such as shadow and intense sunlight reflections, while the complex scenes are the opposite. Moreover, given that the image sensors used in different scenes are likely to be various, the images we used include different resolutions.

In the experiments, two principal river segmentation methods utilizing image features [8] and edge detection [9], respectively, were compared with our method. It should be noted that, because the proposed algorithm is designed specifically for river scenes, the general image segmentation algorithm using Deep Learning has high requirements on data sets and operational capabilities, so it is beyond the range of comparison. The running environment in this study was—Python3 in MacOS system with 2.9 GHz Intel Core i5 CPU, 16GB memory. The algorithm's parameters were set to fixed value empirically in advance. Further discussion about the parameters is in this section later.

#### *3.1. Pre-Processing*

The original input images that are too large in size need to be scaled down to reduce the time consumed by the subsequent algorithm, which was followed by denoising and blurring. Therefore, a threshold for input image size (denoted as *S*o) was set in advance and circulated downsampling was likely to be performed, as shown in Figure 8.

**Figure 8.** Pre-processing operations.

Since the spikes or glitches in the distribution signal of pixel values, which were usually caused by noise, had a great effect on local *H* variance, the blurring operation was significant to obtain reliable *H* values. Therefore, a Gaussian blur filter was introduced to reduce the influence of image noise before the image was analyzed. The results were compared in Figure 9. The picture on the right shows the distribution of *H* and *I* values of pixels lying at the column (the red line) in the image after Gaussian blurring. The *H* values after blurring were more suitable to use for the subsequent feature analysis.

**Figure 9.** Pixel values before and after blurring operation.

#### *3.2. Experiments in Simple Scenes*

The performance of different methods was evaluated by Pixel Accuracy(*PA*), Mean Intersection over Union(*MIoU*) which are two types of widely used criteria in image segmentation [30], shown as follows:

$$PA = \frac{\sum\_{i=0}^{k} p\_{ii}}{\sum\_{i=0}^{k} \sum\_{j=0}^{k} p\_{ij}} \, \tag{18}$$

$$MIIoI = \frac{1}{k+1} \sum\_{i=0}^{k} \frac{p\_{ii}}{\sum\_{j=0}^{k} p\_{ij} + \sum\_{j=0}^{k} p\_{ji} - p\_{ii}} \, \tag{19}$$

where *Pij* indicates the number of pixels of class *i* that are predicted to belong to class *j*, where there are *k* + 1 classes in total. In this study, *k* + 1 = 2. To better evaluate the overall segmentation performance, *PA* and *MIoU*, which may have different weights in practical applications, were merged to generate the weighted harmonic mean *F<sup>β</sup>* as:

$$F\_{\beta} = \frac{\left(1 + \beta^2\right)PA \times MIoII}{\beta^2 \times PA + MIoII},\tag{20}$$

where *β* > 0 measures the relative importance between *PA* and *MIoU*. When *β* > 1, *MIoU* has a greater impact. In practice, *MIoU* was slightly more important, thus *β* = 1.5 is adopted.

The results of some examples are shown in Figure 10 where the detected water region by "intensity + texture" method is marked with blue, while those by "edge detection" and the proposed algorithm is highlighted in red for edges. Table 1. shows the criteria values of the result.

(**a**)

(**b**)

(**c**)

(**d**)

**Figure 10.** Segmentation results of different methods in simple scenes (**a**–**e**). From left to right, the first column shows the input images after pre-processing; The second and third column are the segmentation results using "intensity + texture" features and the adaptive threshold edge detection algorithm respectively; The fourth column are the results of our method.

**Table 1.** Performance of different methods in simple scenes.


All the three algorithms achieved not bad segmentation results, which meant they were all effective for segmentation of simple river scenes. But the proposed algorithm had a more stable performance. More importantly, the proposed algorithm took the least time, as obviously shown in Figure 11.

**Figure 11.** Speed of different methods in simple scenes.

The method utilizing "intensit + texture" features was not only to calculate the brightness and texture information of each small image patch, but also to achieve the decision by the result of the clustering algorithm. The edge detection-based method often obtained many edges at the initial time. The adaptive threshold method required a lot of calculations to pick the one that is most likely to be the edge of the river. Both of them required a large amount of computation. However, the algorithm designed in this paper was essentially a fast two-class classification process on each image patch by using a preset threshold. The improved LBP feature is based on the comparison of intensity of neighbor pixels instead of exact calculations. Therefore, the proposed algorithm consumed the least time.

#### *3.3. Experiments in Complex Scenes*

Besides the simple river scenes, there are also some complex outdoor scenes where the traditional algorithms are difficult to take effect or even fail. Tests of different methods on complex scenes were conducted, among which four typical examples are shown in Figure 12 with the corresponding criteria values shown in Table 2.

Moreover, Figure 13 shows the speed of different methods in complex river scenes.


**Table 2.** Performance of different methods in complex scenes.

(**a**) river scene with clear reflection

(**b**) river scene with a covered object (a boat) on river surface

(**c**) river scene with intense sunlight reflection

(**d**) river scene with large area shadow

**Figure 12.** Segmentation results of different algorithms in complex scenes (**a**–**d**). From left to right: The first column are the input images after pre-processing; The second and third column are respectively the segmentation results using "intensity + texture" features and the adaptive threshold edge detection algorithm; The fourth column are the results of our method.

**Figure 13.** Speed of different methods in the complex scenes.

The proposed algorithm showed robust performance in complex scenes, and it got the highest *F<sup>β</sup>* and the least time cost compared with other methods. The proposed algorithm is proved effective and has better segmentation performance of river images.

The method utilizing "intensity + texture" features was prone to false detection. As shown in Figure 12, some pixels on the riverside were also detected as water. This is because some parts of the riverside in the image had similar features to the designed one. Therefore, the method simply using global image features could be confused. As for the method based on edge detection, it was likely to miss part of the water region with shadow due to the strong edge of clear shadow. This method could not distinguish whether the detected edge was a riverbank or other edges, which resulted in mistakes. However, the improved LBP and *H* variance features designed in this study were local features based on the water surface reflection mechanism, which was close to characteristics of water pixels. Such features could describe not only common water part of the image, but also those with complex appearance like light and covered shadow. To illustrate this, the results of each step in our algorithm are shown in Figure 14 to show how it works.

(**a**) river scene with clear reflection

(**b**) river scene with a covered object (a boat) on river surface

(**c**) river scene with intense sunlight reflection

(**d**) river scene with large area shadow

**Figure 14.** Performance of the proposed method in complex scenes (**a**–**d**). From left to right, the first column includes the images after pre-processing. In the second column,The blue squares in images represent the detection result of water region with the improved LBP feature. In the third column the green squares represent the detection result by *H*-variance feature developed from the second column images. The fourth column images are the binary mask of detection results after the designed morphological operation. The fifth column images are final segmentation results of river region indicated by a covered translucent blue area.

### *3.4. Discussion of Patch Size*

The patch size, that is, the size of each detection window in the image, was the basic unit in the feature extraction operation in our algorithm. Theoretically, using a smaller patch size is faster in feature extraction, but the total times of feature calculation will increase, while the larger patch size made each patch contain more pixels, which might include negative samples (non-water pixels) that damaged the judgment of segmentation algorithm. Figure 15 shows the segmentation results using different patch sizes in our algorithm.

**Figure 15.** Segmentation results under different patch sizes from 6 × 6 to 36 × 36 pixels used in the proposed algorithm.

The *F<sup>β</sup>* and speed under different patch sizes were shown in Figure 16. With the patch size increased, *F<sup>β</sup>* (*β* = 1.5, see Equation (20)) reduced and the time cost was lower. After comprehensive consideration of the segmentation performance and time consumed, a 6 × 6 patch size is usually adopted in practice.

**Figure 16.** Performance of the proposed algorithm under using different patch sizes. (**a**) *PA*, *MIoU* and *F<sup>β</sup>* under different patch sizes. (**b**) Speed under different patch sizes.

#### *3.5. Discussion of Structuring Element*

The size and shape of the structuring elements affected the final segmentation result. Some tests were performed on different resolution images using different sizes of structuring elements from 1/5 to 1/30 of the input image size. Two examples with criteria measuring the segmentation performance were shown in Figure 17.

**Figure 17.** Performance under different sizes of structuring elements in the proposed algorithm. (**a**) An example of simple river scene; (**b**) An example of complex river scene.

As shown in the results, when the size of the structuring element grew larger than 1/15 of the input image size, the segmentation performance distinguishes little. Based on more experiments on the dataset, the size could be set to 1/15 of the size of the input image, where the algorithm was usually effective and reliable.

#### **4. Conclusions**

In this study, we focus on the image segmentation of outdoor river scenes. To solve the problem that current methods often missed detection and made false segmentations when applied to complex river scenes, this study proposed a novel segmentation method based on a reflection mechanism of the water surface. An improved LBP feature descriptor was designed for water detection and *H* variance was introduced to detect the shadow area of the water's surface. Morphological operation with multiple dilation was employed to eliminate pseudo-water patches wrongly detected by the proposed algorithm and to obtain the largest connected domain in the image as water region. The experiments were performed in simple and complex river scenes respectively where the proposed method was compared with two other river segmentation methods. The results showed the proposed method took the least time and had better and robust performance in both simple and complex river scenes.

At present, the proposed algorithm has only been proven to be suitable for segmenting water parts in river images. Since the algorithm is designed based on the reflection mechanism of the water surface, it remains to be further studied whether it is effective for other types of images. The design ideas of the proposed algorithm may be helpful to other segmentation algorithms.

In the future, research can be conducted on anomaly detection of water surfaces based on the proposed method. This study is also important for unmanned surface vehicles (USVs) and river mapping.

**Author Contributions:** Conceptualization, J.Y.; Methodology, Y.L.; Data curation, Y.Z.; Supervision, P.H., J.Y., G.Z. and D.H.; Validation, W.X.; Writing—original draft, Y.L.; Writing—review & editing, P.H., J.Y. and D.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the Fundamental Research Funds for the Central Universities (No.2019QNA5015), the National Natural Science Foundation of China (No. 61803333, 61573313), the Key Technology Research and Development Program of Zhejiang Province (No.2015C03G2010034), and the National Key R&D Program of China (No.2017YFC1403801).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


Autonomous Systems, Marina del Rey, CA, USA, 25–27 March 2002; Gini, M., Shen, W.-M., Torras, C., Yuasa, H., Eds.; IOS Press: Amsterdam, The Netherlands, 2002; pp. 124–133.


 c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Spectrogram Classification Using Dissimilarity Space**

#### **Loris Nanni 1,\* , Andrea Rigo <sup>1</sup> , Alessandra Lumini <sup>2</sup> and Sheryl Brahnam <sup>3</sup>**


Received: 19 May 2020; Accepted: 9 June 2020; Published: 17 June 2020

**Abstract:** In this work, we combine a Siamese neural network and different clustering techniques to generate a dissimilarity space that is then used to train an SVM for automated animal audio classification. The animal audio datasets used are (i) birds and (ii) cat sounds, which are freely available. We exploit different clustering methods to reduce the spectrograms in the dataset to a number of centroids that are used to generate the dissimilarity space through the Siamese network. Once computed, we use the dissimilarity space to generate a vector space representation of each pattern, which is then fed into an support vector machine (SVM) to classify a spectrogram by its dissimilarity vector. Our study shows that the proposed approach based on dissimilarity space performs well on both classification problems without ad-hoc optimization of the clustering methods. Moreover, results show that the fusion of CNN-based approaches applied to the animal audio classification problem works better than the stand-alone CNNs.

**Keywords:** audio classification; dissimilarity space; siamese network; ensemble of classifiers; pattern recognition; animal audio

#### **1. Introduction**

Sound classification and recognition have been applied in different domains, e.g., speech recognition [1], music classification [2], environmental sound recognition, and biometric identification [3]. Traditionally, in pattern recognition problems, features have been extracted from the actual audio traces (e.g., Statistical Spectrum Descriptor and Rhythm Histogram [4]). However, by replacing audio traces by their visual representation, image classification techniques can be used to extract features on sound classification problems. The most commonly used visual representation of audio traces involves the display of their frequency spectrum as they vary in time, as in spectrograms [5] and Mel-frequency Cepstral Coefficients spectrograms [6]. A spectrogram can be described as a graph with two dimensions (time and frequency) plus a third dimension in terms of pixel intensity [7] that represents the signal amplitude in a specific frequency at a particular time step. Costa et al. [8,9] applied several classification and texture analysis techniques to music genre classification using such a method. In [9], the authors extracted grey level co-occurrence matrices (GLCMs) [10] from spectrograms, while in [8] they used the local binary pattern (LBP) [11], which is a popular texture descriptor. In [12], two other feature descriptors were extracted from audio images: local phase quantization (LPQ) and Gabor filters [13]. In 2017, Nanni et al. [2] demonstrated on multiple audio datasets how the fusion of acoustic features extracted from audio traces using state-of-the-art texture descriptors greatly improves the accuracy of acoustic and visual feature-based systems.

When deep learning became popular and Graphic Processing Units (GPUs) became more powerful at accessible costs, traditional pattern recognition changed, and attention focused even more on visual representations of acoustic traces. In the traditional machine learning framework, the optimization of the feature extraction step plays a key role, especially with the evolution of handcrafted features, which minimize the distance between patterns of the same class in the feature space while simultaneously attempting to maximize their distance from the patterns of other classes. Since deep classifiers learn the best features for describing patterns during the training process, these engineered features have diminished in significance, playing in the deep framework more of a supporting role when combined with features extracted from visual representations of acoustic traces that the deep classifiers determine are most informative. Another reason for the growing popularity of representing audio as images is the fact that the convolutional neural network (CNN), one of the most famous deep classifiers, requires images for its input. In their study, Humphrey and Bello [14,15] explored CNNs as an alternative approach to music classification problems, establishing the state-of-the-art in automatic chord detection and recognition. Nakashika et al. [16] converted spectrograms into GCLM maps to train CNNs for music genre classification, and Costa et al. [17] performed better than the state-of-the-art on the LMD dataset by fusing canonical approaches, e.g., LMP-trained SVMs with CNNs. Only a few studies, however, have focused on making these processes that were designed for image classification more specific for sound image recognition. In their study, Sigtia and Dixon [18] focused on adjusting CNN parameters and structures and showed how using Rectified Linear Units (ReLu) instead of stochastic gradient descent with the Hessian Free optimization and sigmoid units reduced training time. Wang et al. [19] presented an innovative CNN, which they named a *sparse coding CNN*, for sound event recognition and retrieval, which, when evaluated under noisy and clean conditions, achieved competitive and sometimes better performance than the majority of other approaches. In Oramas [20], a hybrid approach was presented that combined diverse modalities (album cover images, reviews, and audio tracks) for multi-label music genre classification by applying deep learning techniques appropriate for each modality, an approach that outperformed the single-modality methods. Finally, it should be mentioned that many methods in machine learning are also proposed for the human voice classification task: emotion recognition [21], English accent classification, and gender classification [22], to name a few.

Because deep classifiers have produced a patent improvement in music classification, researchers have begun to apply deep learning approaches to other sound recognition tasks, such as biodiversity assessment. Precise sound recognition systems can be of crucial importance in assessing and handling environmental threats like animal species loss and climate changes affecting wildlife fauna [23]. Birds, for instance, have been acknowledged as an indicator species for ecological research, and their monitoring has become increasingly important for biodiversity preservation [23], especially considering the minimal impact video and audio acquisition has on ecosystems. To date, many datasets are available to develop classifiers to identify and monitor different species, such as birds [24,25], whales [26], frogs [24], bats [25], and cats [27]. For instance, both Cap et al. [28] and Salamon et al. [29] have investigated the fusion of CNNs with other methods to classify animals. The former study combined CNNs with handcrafted features to classify marine animals [30] using the Fish and MBARI benthic animal dataset [31], while the latter fused deep learning with shallow learning for bird species identification based on 5428 bird flight calls from 43 species. In both cases, the fusion of CNNs with other canonical techniques outperformed the single approaches.

Existing approaches for animal audio classification can roughly be classified into two categories: fingerprinting and CNN approaches. Fingerprinting [32] relies on the compact representation of audio traces so that each one can be efficiently matched against other audio clips to compare for similarity and dissimilarity [33]. A sample of audio fingerprinting by CNN is shown in [34], where the authors used a Siamese neural network to produce semantic representations of the audio traces. However, fingerprinting is useful only in finding an exact match; the problem addressed in this work involves audio classification. As already noted, CNN-based approaches [35,36] train networks for animal audio classification starting from an image representation of the audio signal. Unfortunately, CNNs require a large number of training examples to be effective (larger than available in most animal

audio datasets) and cannot generalize to new classes without retraining the network. The objective of this work is to solve these issues by proposing an approach based on Dissimilarity Spaces. Recently, Agrawal [37] proposed an approach that learns a distance model by training a Siamese neural network directly on dissimilarity values for brain image classification, and in [38] an approach is proposed for online signature verification using a Siamese neural network and a contrastive loss function. In the latter work, the authors claim that the main advantage a Siamese network offers over a canonical CNN is the ability to generalize: the Siamese network approach they developed was shown to verify the authenticity of the signature of a new user without being trained on any examples from this user.

In this work, the dissimilarity space is created using a Siamese Neural Network (SNN) trained on the entire training set to define a distance function among the samples. The training phase for SNN is aimed at maximizing the distance between patterns of different classes; the testing phase of the SNN is used to compare two spectrograms to obtain a measure of their dissimilarity. In theory, all the training samples can be selected as centroids of the dissimilarity space. Dimensionality reduction is obtained by selecting a smaller number (*k*) of prototypes via a clustering approach. The dissimilarity space is the space where each spectogram is represented by a its distance to each centroid/prototype: in this space, the SNN is used to compare the spectrogram to every centroid, obtaining the spectrogram's dissimilarity vector, which is the final descriptor. The classification task is performed by a support vector machine (SVM) trained using the dissimilarity descriptors generated from the training samples. The proposed system is evaluated on two different datasets for animal audio classification: domestic cat sounds [27] and bird sounds [23]. Results for the different clustering methods and different values of the hyperparameter (*k*) are reported.

In addition, an ensemble of SVMs trained on different dissimilarity spaces (by changing the value of *k*) are combined by sum rule, and its performance is compared with (i) some canonical CNN approaches and (ii) the fusion of the SVMs and the CNNs. Experiments demonstrate for the first time that the use of dissimilarity spaces based on SNN is a feasible representation for image data and can, when combined with a general purpose classifier, achieve high classification performance. Because the descriptors obtained in the dissimilarity space show high diversity with respect to the representations based on CNNs, their fusion can be exploited in an ensemble, as proven by the high classification accuracy obtained by the fusion of CNNs with our approach. The MATLAB code used in this study is freely available at https://github.com/LorisNanni.

#### **2. Proposed Approach**

The proposed method for spectrogram classification using dissimilarity space is based on several steps which are schematized in Figure 1. This figure is followed by the pseudo-code for each step (Algorithms 1 and 2). In order to define a similarity space, it is necessary to select a distance measure and a set of prototypes in the training phase. The distance measure *d*(*x*, *y*) is learned by means of a SNN trained to maximize the similarity between couples of spectrograms in the same class, while minimizing the similarity for couples in different classes. The set of prototypes *P* = *p*1, ...*p<sup>k</sup>* are obtained as the *k* centroids of the clusters generated by a supervised clustering procedure. The final step represents each training sample *<sup>x</sup>* in the dissimilarity space by a feature vector *<sup>f</sup>* ∈ <*<sup>k</sup>* , where each component *f<sup>i</sup>* is the distance between *x* and the prototype *p<sup>i</sup>* : *f<sup>i</sup>* = *d*(*x*, *pi*). These feature vectors are used to train a SVM for the final classification task. In the testing phase, each unlabeled spectrogram is first represented in the dissimilarity space by calculating its distance to all the prototypes, then the resulting feature vector is classified by SVM.

**Figure 1.** Proposed approach scheme.

#### **Algorithm 1** Training phase

**Input:** Training images (*imgsTrain*), training labels (*labelTrain*), the number of training iterations

(*trainIterations*), batch size (*trainBatchSize*), number of centroids (*k*), and the clustering technique (*type*).

**Output:** Trained SNN (*tSNN*), set of centroids (*C*), and trained SVM (*svm*).

1: *tSNN* ← TRAINSIAMESE(*imgsTrain*, *labelTrain*, *trainIterations*, *trainBatchSize*)


**Algorithm 2** Testing phase

**Input:** Test images (*imgsTest*), trained SNN (*tSNN*), Set of centroids (*C*), Trained SVM (*tSVM*). **Output:** Actual test labels (*labelTest*).

1: *F* ← GETDISSSPACEPROJECTION(*imgsTest*, *P*, *tSNN*)

2: *labelTest* ← PREDICTSVM(*F*, *tSVM*)

Each of the main functions used in the pseudo-code are described below.

#### *2.1. Siamese Neural Network Training*

The SNN, described in more detail in Section 3, is trained to compare a pair of spectrograms by returning a measure of their similarity. Algorithm 3 presents the pseudocode for this phase and corresponds with step 1 of Algorithm 1. The SNN architecture is defined in steps 2 and 3 of algorithm Algorithm 3. Steps 5–8 are repeated for each training iteration. Step 5 extracts randomly *batchSize* spectrograms pairs from the training set using the function GETSIAMESEBATCH. Step 6 feeds the pairs to the network and computes loss and gradients for gradient descent. Steps 7 and 8 use the gradients and loss to update the weights of the fully connected layer and the twin subnetworks.

**Algorithm 3** Siamese training pseudocode


Note: in the case where the SNN fails to converge on the training set, training is rerun.

## *2.2. Prototype Selection*

In this phase, *k* prototypes are extracted from the training set. In theory, every spectrogram in the training set could be selected as a prototype, but this would be too resource expensive and the dimensionality of the generated dissimilarity vectors would be too high. A better alternative is to employ clustering techniques to compute *k* centroids for each class. Clustering would significantly reduce the dimension of the resulting dissimilarity space and thus make the process more viable. Algorithm 4 presents the pseudo code for prototype selection, which provides a selection from among four clustering procedures, which are used separately to cluster the training samples belonging to each class.

### **Algorithm 4** Clustering pseudocode

**Input:** Training images (*imgsTrain*), training labels (*labelTrain*), number of clusters (*k*), and clustering

```
technique (type).
Output: Centroids P.
 1: function CLUSTERING
 2: numClasses ← number of classes from labelTrain
 3: kc ← k/numClasses
 4: for i ← from 1 to numClasses do
 5: images ← images of the class i from imgsTrain
 6: switch type do
 7: case "k-means" Pi ← KMEANS(imgs,kc)
 8: case "k-medoids" Pi ← KMEDOIDS(imgs,kc)
 9: case "hierarchical" Pi ← HIERARCHICAL(imgs,kc)
10: case "spectral" Pi ← SPECTRAL(imgs,kc)
11: P ← P ∪ Pi
12: end for
13: return P
14: end function
```
### *2.3. Projection in the Dissimilarity Space*

Existent classification methods learn to classify patterns using their feature space. In this work, patterns are represented in a dissimilarity space in which every pattern *x* is represented by its similarity to a selected set of prototypes *P* = *p*1, ...*p<sup>k</sup>* by a dissimilarity vector:

$$F(\mathbf{x}) = [d(\mathbf{x}, p\_i), d(\mathbf{x}, p\_{i+1}), \dots, d(\mathbf{x}, p\_k)],\tag{1}$$

where the similarity among pattern *d*(*x*, *y*) is obtained using a trained SNN. In order to project each image in the Dissimilarity space <sup>&</sup>lt;*<sup>k</sup>* , Algorithm 5 compares each input image (stored in *X* in step 3) with the *k* centroids (stored in *P*) using the trained SNN *tSNN* with the PREDICTSIAMESE function (step 4). The resulting feature space *F* includes the projected features of all the input images.


#### *2.4. Support Vector Machine Training and Prediction*

A Support Vector Machine (SVM) is a supervised learning model witch can be used to perform classification or regression. An SVM model represents each training example as a data point in space and is trained to construct one or more hyperplanes that divide the space in two, separating data points belonging to different classes (function TRAINSVM). The model will predict (function PREDICTSVM) the class of a new pattern mapped in the space according to the side of the hyperplane the data point falls into. The hyperplane found by an SVM is defined as follows:

$$D(\mathbf{x}) = \mathbf{w} \* \mathbf{x} - b\_{\prime} \tag{2}$$

where *D*(*x*) is the hyperplane, *x* is the data point vector, *w* is the hyperplane's normal vector, and the *b* ||*w*|| ratio is the hyperplane's distance from the origin. The optimal hyperplane is the one that maximizes the distance to the nearest data point of any class, defined as <sup>2</sup> ||*w*|| , which is also called the *margin*. The *i*-th point *x<sup>i</sup>* will be assigned to the first class when *D*(*xi*) ≥ +1 and to the second class when *D*(*xi*) ≤ −1. The points that lie on the margin line, defined by the equation *D*(*xi*) = ±1, completely describe the solution to the problem and are called *support vectors*. An example of an optimal hyperplane with highlighted support vectors is shown in Figure 2.

**Figure 2.** SVM's hyperplane.

Because SVMs use hyperplanes to discriminate data, they do not work well with data that is not linearly separable in its original space. This problem can be solved using kernel functions, which map data into a much higher dimensional space, presumably to make the separation easier in that space. To keep the computational complexity to an acceptable level, the kernel function of choice has to be computationally efficient.

Being binary classifiers, SVMs can only determine the separation surface between two classes of data; however, it is possible to apply SVMs to multi-label problems by training an ensemble of SVMs and combining them. In this work, the *One-Against-All* approach is used, where for each class an SVM is trained to discriminate between a given class and all the other classes put together. The pattern is then assigned to the class that gives the higher confidence score.

#### **3. Siamese Neural Network**

The Siamese Neural Network (SNN) is a class of neural network architectures that contains two or more twins, i.e., sub-networks with the same parameters and weights. SNNs are used in tasks involving similarity or in identifying correlations between different entities. SNN was first proposed by Bromley et al. [39] for performing signature verification. SNNs have since been used successfully in other application domains, such as face verification [40], image recognition [41], human fall detection [42], content-based audio representation [34], and sound search by vocal imitation [43]. The SNN architecture used in this work is similar to the one used in [43] and is represented in (Figure 3).

**Figure 3.** Siamese Neural Network architecture.

As shown in Figure 3, the SNN used in this work is composed of five blocks:

• *Two identical twin subnetworks*

The twin subnetworks in our SNN are two Convolutional Neural Networks composed of 13 layers, as listed in Table 1.


**Table 1.** Siamese subnetworks layers.

These subnetworks learn the features best representing the spectrograms in the input (*X*1 and *X*2), returning a 4096-dimensional feature vector for each (*F*1 and *F*2). The subnetworks share parameters and weights which are mirrored during the training.

• *Subtract block*

The output vectors of the subnetworks are subtracted, resulting in a feature vector *Y* representing the features in which the images differ:

$$Y = |F1 - F2|\tag{3}$$

• *Fully Connected Layer*

As in [37], the Fully Connected Layer (FCL) learns the distance model to calculate the dissimilarity. The output vector of the subtract block is fed to the FCL which returns a dissimilarity value for the pair of spectrograms in the input.

• *Sigmoid*

The sigmoid function is a class of mathematical real functions having a characteristic S-shaped curve. We apply the sigmoid to the dissimilarity value returned by the FCL to convert it to a probability value in the range [0, 1], using the standard logistic function:

$$S(\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{x}}} \tag{4}$$

• *Binary Cross Entropy*

The Binary Cross Entropy (BCE) is a popular loss function, which, given the prediction of the model and the correct observation label (in our case, 1 if the two spectrograms belong to the same class, 0 otherwise) returns a measure of the performance of the model. Loss functions are used by learning algorithms to train the network by adjusting the weights. BCE is applied to the probability obtained from the sigmoid and computes the gradients of the loss function with respect to the weights of the network in order to adjust them. In a two-class problem, BCE can be calculated as:

$$BCE(y, p) = -(y \log(p) + (1 - y) \log(1 - p)),\tag{5}$$

where *y* is the binary value that indicates whether the class label *c* is correct for the observation *o*, *p* is the predicted probability that observation *o* is of class *c*, and log is the natural logarithm.

#### **4. Clustering**

Clustering is the task of organizing data in groups (Figure 4) so that patterns in the same cluster are more similar to each other than they are to patterns belonging to other clusters. Clustering is often used to find natural clusters in unlabeled data. Some clustering techniques calculate centroids during the process. A centroid is the mean vector of all the patterns in a cluster. Because it is a mean vector, it contains the most characterizing features of a cluster's patterns. Centroids are computed to reduce the dissimilarity space size without losing too much information. The greater the number of centroids used for each class, the more information that is retained. In this work, samples are divided into classes before clustering, and the clustering procedure is applied to each class separately. The remainder of this section describes the four clustering techniques used in this study.

**Figure 4.** A sample of clusters found from unlabeled data: on the left the original 2D data, on the right clustered data, where different colors denote different clusters.

#### *4.1. K-Means*

K-means is a popular clustering algorithm that partitions a set of patterns into *k* clusters by assigning each observation to the cluster with the nearest centroid, or mean vector. There are several versions of this algorithm. In this study, the default implementation (with the Euclidean distance metric) in the MATLAB Statistics and Machine Learning Toolbox was applied. The standard k-means algorithm cycles through the following steps:


The k-means++ variation [44] employs a heuristic to find the initial centroids:


## *4.2. K-Medoids*

K-medoids is a clustering technique very similar to k-means. It partitions a set of observations into *k* clusters by minimizing the sum of distances between a pattern and the center of that pattern's cluster. The main difference between k-means and k-medoids is that, in the first case, the center of a cluster is its centroid, or mean, whereas, in the latter case, the center is a member, or medoid, of the cluster. A medoid is an observation in a cluster whose sum of distances from the other observations within the cluster is minimal. The basic algorithm for K-medoids loops through the following three steps:


## *4.3. Hierarchical*

Hierarchical clustering is a clustering technique that groups data by building a hierarchy of clusters. The hierarchy tree that is obtained is divided into *n* levels chosen for the application at hand. There are two main categories of hierarchical clustering:


In this work, the default MATLAB implementation of hierarchical clustering is used, which is the agglomerative type. The MATLAB algorithm loops through the following three steps:


After applying this algorithm, centroids, as the mean vectors of each cluster, are computed.

## *4.4. Spectral*

The spectral clustering technique splits data into groups using the data's undirected similarity graph represented by a similarity matrix (also called an *adjacency matrix*). In the similarity graph, every·node is an observation, and two nodes are connected by an edge if their similarity is larger then a certain threshold, which is often 0. The algorithm uses four mathematical expressions:


$$D\_{\mathcal{S}}(i, i) = \sum\_{j} m\_{ij\nu}$$

where *D<sup>g</sup>* is the degree matrix, and *mij* is a value of the similarity matrix.

• *Laplacian Matrix*: another way of representing the similarity graph that is defined as

$$L = D\_{\mathcal{S}} - M.$$

Here are the steps required by the spectral algorithm:


## **5. Experimental Results**

The approach proposed in this paper is tested, along with some comparison canonical approaches, using a stratified ten-fold cross validation protocol and the classification accuracy as the performance indicator. Tests were performed on two datasets:


In Tables 2 and 3, the performance of the four tested clustering algorithms is reported using different values of *kc* (i.e., the number of clusters per class). As a baseline for comparison, the classification accuracy is also reported for the following well-known CNN models, each fine-tuned on the problem (for 30 epochs, using a batch size of 30, and a learning rate of 0.0001, no freezing):


Moreover, in Tables 2 and 3, the accuracy obtained by the following fusion approaches are reported:



**Table 2.** Classification accuracy on the BIRDZ dataset.


**Table 3.** Classification accuracy on the CAT dataset.

From the results reported in Tables 2 and 3, the following conclusions can be drawn:


The proposed approach based on the representation of animal sound in a dissimilarity space has two main advantages: (1) it produces a compact representation on the signal (ranging from 15 to 60, depending on the number of clusters for the single space, to 150 for the KAll ensemble); (2) it generates a high diversity of classification results with respect to the baseline CNNs, which can be exploited to improve the performance in an ensemble method (i.e., ALL+GoogleNet).

In Table 4, the ensembles proposed in this work are shown to achieve a performance on the two datasets that is similar to some of the state-of-the-art approaches reported in the literature. Two results are taken from [27], and are labeled [27] and [27]-*CNN*.

Unfortunately, most published papers in the field of acoustic animal classification focus only on a single dataset. The authors of this paper are aware that evaluating the proposed approach on two different datasets instead of focusing on just one limits the strength of the conclusions drawn. Be that as it may, the experiments reported here prove the robustness of the proposed approach, which obtains good classification accuracy on two different problems without any ad-hoc parameter optimization and according to a clear and unambiguous testing protocol. As a result, the performances reported in this paper can be used for baseline comparisons with other audio classification methods developed in the future.


**Table 4.** Literature results.

\* Note that the results in [51] are based on a feature selection approach where the number of selected features is the hyperparameters selected on that dataset; the approach presented here has no hyperparameters selected on a given dataset.

#### **6. Conclusions**

In this work, a method using dissimilarity space is presented that achieves competitive results in automated audio classification of animal sounds (bird and cat sounds). Different types of clustering techniques to obtain centroids for dissimilarity space generation were tested and compared. A set of SVMs was trained on the dissimilarity spaces generated using four clustering techniques and different numbers of centroids. These SVMs were then combined by sum rule to obtain a high performing ensemble.

Moreover, it is shown that the method presented here can be fused with other state-of-the-art approaches to improve classification accuracy. The proposed ensemble of SVMs was fused with other state-of-the-art approaches. The fusions improved performance on the two audio classification problems and were shown to outperform the standalone approaches.

In the future, this study will be further developed by including other sound classification problems, e.g., those cited in [26,37], in order to obtain a more comprehensive validation of the proposed approach. The plan is also to test the proposed method on some image classification problems using additional supervised and unsupervised clustering techniques.

**Author Contributions:** L.N. conceived of the presented idea., A.R. carried out the implementation. L.N., A.L. performed the experiments. A.L. and S.B. wrote the manuscript with input from all authors. S.B. provided some resources. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors thank NVIDIA Corporation for supporting this work by donating a Titan Xp GPU.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Texture Segmentation: An Objective Comparison between Five Traditional Algorithms and a Deep-Learning U-Net Architecture**

#### **Cefa Karaba˘g <sup>1</sup> , Jo Verhoeven 2,3, Naomi Rachel Miller <sup>2</sup> and Constantino Carlos Reyes-Aldasoro 1,\***


Received: 30 July 2019; Accepted: 10 September 2019; Published: 17 September 2019

**Abstract:** This paper compares a series of traditional and deep learning methodologies for the segmentation of textures. Six well-known texture composites first published by Randen and Husøy were used to compare traditional segmentation techniques (co-occurrence, filtering, local binary patterns, watershed, multiresolution sub-band filtering) against a deep-learning approach based on the U-Net architecture. For the latter, the effects of depth of the network, number of epochs and different optimisation algorithms were investigated. Overall, the best results were provided by the deep-learning approach. However, the best results were distributed within the parameters, and many configurations provided results well below the traditional techniques.

**Keywords:** texture; segmentation; deep learning

### **1. Introduction**

Texture, and more specifically textural characteristics in images, has been widely studied in the past decades as texture is one of the most important features present in images and can be used for feature extraction [1–8] and classification and segmentation [9–14]. The areas of study where texture is present range from crystallographic texture [15], stratigraphy [16,17], food science of potatoes [18] or apples [19], patterned fabrics [20] to natural stone industry [21]. In medical imaging, there is a large volume of research which exploits the use of texture for different purposes, like segmentation or classification in most acquisition modalities like magnetic resonance imaging (MRI) [22–26], ultrasound [27,28], computed tomography (CT) [29–31], microscopy [32,33] and histology [34]. There are numerous approaches to texture: Haralick's co-occurrence matrix [4,5] on the spatial domain, Gabor filters [35–37] and ordered pyramids [8] on the spectral domain, wavelets [38,39] or Markov random fields [3,40].

In recent years, advances in artificial intelligence have revolutionised image processing tasks. Several deep learning approaches [41–43] have achieved outstanding results in difficult tasks such as those of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [44]. Convolutional Neural Networks (CNNs) are well suited to analyse textures as their repetitive patterns can be learned and identified by filter banks [45]. The U-Net architecture proposed by Ronneberger [46] has become a very widely used tool for segmentation and analysis reaching thousands of citations in the few years since it was published. U-Nets have been used widely, for instance, for road extraction [47], singing voice

separation [48], automatic brain tumour detection and segmentation [49] and cell counting, detection, and morphometry [50]. The success of these deep learning approaches in very different areas invites for their application on texture analysis.

In this work, a U-Net architecture for the segmentation of textures is implemented and objectively compared against several popular traditional segmentation strategies. The traditional algorithms (co-occurrence matrices [5], watershed [51], local binary patterns (LBP) [52,53], filtering [54] and multiresolution sub-band filtering (MSBF) [8]) were selected as these have been previously published using the texture composites proposed by Randen [55] and thus an objective numerical comparison is possible.

To perform an objective comparison, six well-known texture composites from the Brodatz [56] album, first published by Randen and Husøy [54], are segmented with U-Nets of different configurations and parameters and the results compared against previously published results. The effects of the configuration of the networks, namely, number of epochs, depth of the network in the number of layers, and type of optimisation algorithm are assessed. All the programming was performed in Matlab® (The MathworksTM, Natick, MA, USA) and the code is freely available through GitHub (https://github.com/reyesaldasoro/Texture-Segmentation).

#### **2. Materials and Methods**

#### *2.1. Texture Composite Images*

Six composite texture images were segmented in this work (Figure 1). The first five composites are images of 256 × 256 pixels and consist of five different textures whilst the last one is 512 × 512 pixels and is formed with 16 different textures. The masks with which these were formed are shown in Figure 2. It should be highlighted that these textures have been histogram equalised prior to the arrangement and thus they cannot be distinguished by the general intensity of each region. It is frequent that comparisons are made over textures that are not equalised (e.g., [57] Figure 3, [45] Figure 2) and thus the segmentation is not only based on the texture but the average intensity of the regions. Furthermore, whilst some textures are easy to distinguish, there are some that are quite challenging, for instance, the difference between the central and bottom regions in Figure 1c or the top left corners of Figure 1d,e. *Appl. Sci.* **2019**, *xx*, 5 3 of 15

**Figure 1.** *Cont*.

**Figure 1.** Six composite texture images. (**a-e**) Texture arrangements with five textures. (**f**) Texture arrangement with sixteen textures. Notice first, that individual textures have been histogram equalised and thus each region cannot be distinguished by the intensity, and second, some textures area easier to

**Figure 2.** (**a**) Mask corresponding to texture arrangements of Figs. 1(a-e). (**b**) Mask corresponding to

The training data in [54] is provided separately and is shown in Figure 3 for the first five composites and in Figure 4 for the last case. For the purpose of training the U-Nets, the training

distinguish (e.g., (a)) than others (e.g., (d)).

texture arrangements of Figure 1(f).

images were tessellated into sub-regions of 32 × 32 pixels each.

*2.2. Training data*

**Figure 1.** Six composite texture images. (**a-e**) Texture arrangements with five textures. (**f**) Texture arrangement with sixteen textures. Notice first, that individual textures have been histogram equalised and thus each region cannot be distinguished by the intensity, and second, some textures area easier to distinguish (e.g., (a)) than others (e.g., (d)). **Figure 1.** Six composite texture images. (**a**–**e**) Texture arrangements with five textures. (**f**) Texture arrangement with sixteen textures. Notice first, that individual textures have been histogram equalised and thus each region cannot be distinguished by the intensity, and second, some textures are easier to distinguish (e.g., (**a**)) than others (e.g., (**d**)).

**Figure 2.** (**a**) Mask corresponding to texture arrangements of Figs. 1(a-e). (**b**) Mask corresponding to **Figure 2.** (**Left**) Mask corresponding to texture arrangements of Figure 1a–e. (**Right**) Mask corresponding to texture arrangements of Figure 1f.

#### texture arrangements of Figure 1(f). *2.2. Training Data*

*2.2. Training data* The training data in [54] is provided separately and is shown in Figure 3 for the first five composites and in Figure 4 for the last case. For the purpose of training the U-Nets, the training The training data in [54] is provided separately and is shown in Figure 3 for the first five composites and in Figure 4 for the last case. For the purpose of training the U-Nets, the training images were tessellated into sub-regions of 32 × 32 pixels each.

images were tessellated into sub-regions of 32 × 32 pixels each. Pairs of textures and labels were constructed simultaneously in the following way: two training images were selected. Sub-regions of each image were selected and for every pair of the sub-regions, half of each was selected and placed together so that a new 32 × 32 patch with both textures was created with a corresponding 32 × 32 patch with the classes. The patches were created with diagonal, vertical and horizontal pairs. The training images were traversed horizontally and vertically without overlap creating numerous training pairs. A montage of the texture pairs and labels corresponding to Figure 1a is illustrated in Figure 5. All pairs between classes were considered i.e., 1-2, 1-3, 1-4, 1-5, 2-1, 2-3, . . . , 5-3, 5-4. In total, 2940 patches were created for the five composites with five textures and 35, 280 were created for the composite with sixteen textures.

**Figure 3.** Training images corresponding to the texture arrangements of Figure 1a–e.

**Figure 4.** Training images corresponding to the texture arrangements of Figure 1f.

**Figure 5.** Montages of the texture pairs created to train the deep learning networks. Training images shown in Figures 3 and 4 were tessellated and arranged in diagonal, vertical and horizontal pairs. (**a**) Texture pairs. (**b**) Labels. (**c**) Detail of the texture pairs. (**d**) Detail of the labels.

#### *2.3. Traditional Texture Segmentation Algorithms*

For this paper, we compared the results of the following texture segmentation algorithms: co-occurrence matrices [5], watershed [51], local binary patterns (LBP) [52,53], filtering [54] and multiresolution sub-band filtering (MSBF) [8] against a U-Net architecture [46].

The traditional algorithms have been thoroughly described in the literature; however, for completeness, a short explanation of how features are extracted with each algorithm will follow. For a discussion of traditional texture techniques, the reader is referred to any of the following reviews [58–60].

Co-occurrence matrices are constructed from a quantised version of a grey level image so that if an image is quantised to 8 levels, the co-occurrence matrix will have 8 rows and columns. The values of each location of the matrix will depend on the number of times that a pair of grey levels jointly occur at a neighbouring distance (e.g., 1 pixel away) with a certain orientation (e.g., horizontally). In this way, a co-occurrence matrix is able to measure local grey level dependence: textural coarseness

and directionality. For example, in coarse images, the grey level of the pixels change slightly with distance, while for fine textures the levels change rapidly. From this matrix, different features like entropy, uniformity, maximum probability, contrast, difference moment, inverse difference moment and correlation can be calculated [5]. Once the features have been calculated, classifiers can be applied directly, or further processing like the watershed transforms can be applied.

Watershed transforms are based on a topographical analogy of a landscape. Should water fall in this landscape, it would find the path through which it could reach a region of minimum altitude, i.e., a basin, sometimes called lake or sea. For each point in the landscape (or pixel of the image) there is a path towards one and only one basin. Thus, the landscape can be partitioned into catchment basins or regions of influence of the regional minima and the boundaries between the basins (e.g., points of inflection) are called the watershed lines. [61]. The watershed transform can be applied to features extracted from the co-occurrence matrix [51]. The basins produced can further be iteratively merged to segment textured regions.

Local binary patterns (LBP) [52], explore the relations between neighbouring pixels. These methods concentrate on the relative intensity relations between the pixels in a small neighbourhood and not in their absolute intensity values or the spatial relationship of the whole data. The underneath assumption is that texture is not properly described by the Fourier spectrum and traditional frequency filters. The texture analysis is based on the relationship of the pixels of a 3 × 3 neighbourhood. A Texture Unit is first calculated by differentiating the grey level of a central pixel with the grey level of its neighbours. The difference is measured if the neighbour is greater or lower than the central pixel. Two advantages of LBP is that there is no need of quantising images and there is a certain immunity to low frequency artefacts. In a more recent paper, Ojala [53] presented another variation to the LBP by considering the sign of the difference of the grey-level differences histograms. Under the new consideration, LBP is a particular case of the new operator called *p*8. This operator is considered as a probability distribution of grey levels, when *p*(*g*0, *g*1) denotes the co-occurrence probabilities, they use *p*(*g*0, *g*<sup>1</sup> − *g*0) as a joint distribution.

Filtering, in the context of image processing, consists of a process that will modify the pixel values. There are spatial filters, which are applied directly to the values of the images (e.g., average neighbouring pixels to blur an image) and filters which are applied after a transformation of the data has been performed. Thus a filter in the frequency or Fourier domain will be applied after the image has been converted through the Fourier transform. The filters in the Fourier domain are sometimes named after the frequencies that are to be allowed to pass through them: low pass, band pass and high pass filters. Since textures can vary in their spectral distribution in the frequency domain, a set of sub-band filters can help in their discrimination. One common frequency filtering approach is that of Gabor multichannel filter banks [2,10,62–64].

The partitioning of the Fourier space can be achieved in different ways, Gabor being only one. A multiresolution approach, based on finite prolate spheroidal sequences is described in [8]. The Fourier space is divided into frequencies and orientations, which are further subdivided in a multiresolution approach. Each filter then produces a feature; different textures are captured by different filters. In addition, a feature selection strategy can improve the texture segmentation.

#### *2.4. U-Net Configuration*

The basic U-Net architecture was formed with the following layers: *Input, Convolutional, ReLu, Max Pooling, Transposed Convolutional, Convolutional, Softmax* and *Pixel Classification*. Two levels of depth were investigated by repeating the downsampling and upsampling blocks in the following configurations:

15 layers: Input, Convolutional, ReLu, Max Pooling, Convolutional, ReLu, Max Pooling, Convolutional, ReLu, Transposed Convolutional, Convolutional, Transposed Convolutional, Convolutional, Softmax, Pixel Classification

#### 20 layers:

Input, Convolutional, ReLu, Max Pooling, Convolutional, ReLu, Max Pooling, Convolutional, ReLu, Max Pooling, Convolutional, ReLu, Transposed Convolutional, Convolutional, Transposed Convolutional, Convolutional, Transposed Convolutional, Convolutional, Softmax, Pixel Classification.

The image input layer was configured for the 32 × 32 patches. The convolutional layers consisted of 64 filters of size 3 and padding of 1. The pooling size was 2 with stride of 2. The transposed convolutional had a filter size of 4, stride of 2 and cropping of 1. The numbers of epochs evaluated were 10, 20, 50, 100. The following optimisation algorithms were analysed: stochastic gradient descent (sgdm), Adam (Adam) [65] and Root Mean Square Propagation (RMSprop). One last investigation was performed by training the 20 layer network two separate times to investigate the variability of the process.

#### *2.5. Misclassification*

For the purposes of assessing the algorithms, a pixel-based assessment will be considered. Each pixel whose class is correctly determined by the segmentation algorithm will be counted as **Correct**, every pixel which the algorithm assigns a different class will be considered as **Incorrect**. Notice that since there is no foreground/background distinction but rather correct or incorrect, both **True Positive** (TP) and **True Negative** (TN) are included as correct, and **False Positive** (FP) and **False Negative** (FN) are included in the incorrect. Thus, the **misclassification** in percentage, or classification error, will be calculated as number of incorrect pixels divided by the total number of pixels of the image *m* = 100 ∗ (*FP* + *FN*)/(*TP* + *TN* + *FP* + *FN*). The accuracy can be calculated as the complement *a* = 100 ∗ (*TP* + *TN*)/(*TP* + *TN* + *FP* + *FN*).

#### **3. Results**

For each image, the networks were trained with the 3 different optimisation algorithms, 3 layer configurations and 4 epoch numbers, for a total of 36 different combinations. Thus for the 6 composite images there were 216 results. The misclassification of each segmentation was measured against the ground truth as the percentage of pixels classified incorrectly. These results are summarised in Table 1.

The best results for each image were selected and compared against traditional methodologies and are shown in Table 2. The results are illustrated graphically in two ways. Figure 6 shows segmented the classes overlaid as different colours over the original textured images. Figure 7 shows correctly segmented pixels in white and the misclassified pixels in black.


**Table 1.** Comparative misclassification (%) results of the different U-Net configurations. (Bold and underline denotes the best result for each image).

**Table 2.** Comparative misclassification (%) results with co-occurrence [5], best filtering result from Randen [54], *p*<sup>8</sup> and LBP [53], Watershed [51], Multiresolution sub-band filtering (MSBF) [8] and U-Net [46]. (Bold is the best for each image).


**Figure 6.** (**a**–**f**) Results of the segmentation with U-Nets for the six texture arrangments. The misclassification (%) is shown in each case. The classes are shown as overlaid colours.

**Figure 7.** (**a**–**f**) Results of the segmentation with U-Nets for the six texture arrangments. The misclassification (%) is shown in each case. Pixels that are correctly classified appear in white.

#### **4. Discussion**

The results provided by the U-Net algorithm provided interesting results in terms of the actual misclassification results against traditional algorithms, and the variability of the U-Net cases. The segmentation results provided by the U-Nets were better in four of the six images. In some cases, the results were very close to the second best option (a: 2.8/2.6, d: 7.3/7.1) and in two cases (e,f) traditional algorithms provided better results (e: 4.3/7.7, f: 17.0/17.5). The average for all the six composites was best for U-Nets, however, given the fact that the difference with the second best is relatively small (0.75), and that traditional algorithms provided better results in 1/3 cases shows that care should be taken when selecting algorithms. This is similar to the conclusion of Randen who stated that "No single approach did perform best or very close to the best for all images" [55].

In terms of the U-Net configuration there are several interesting observations. First, there was a great variability in the results produced by the different U-Net configurations. It was surprising that the maximum value of the misclassification in some cases was extremely high, 80% in the cases of 5 textures and 94% in the case of 16 textures, those cases are equivalent of selecting a single class for all textures. Second, three of the best results were obtained with 100 epochs, 2 with 10 epochs, and 1 with 50, which is counter-intuitive as it would be expected that longer training times would provide better results. Third, three of the best results were provided by RMSprop optimisation, two by Adam and one by sgdm. Fourth, and perhaps the most surprising result was that the results provided by the two 20 layer configurations were very different. In a few cases the result were equal (e.g., image c, sgdm, 10 epochs; image b, Adam, 10 epochs) but in others the variation was huge (e.g., image b, Adam, 50 epochs).

In terms of texture, it can be highlighted that not all textures are the same, the five textures of image (a) are far easier to distinguish and correctly segment than those of image (b) and image (f). The U-Net was capable of segmenting these textures with accuracy comparable or better than traditional techniques. As mentioned previously, the fact that the textures have been histogram equalised removes the discrimination of the regions by their average intensities. More complex architectures, e.g., Siamese Networks [57] could provide better results, but it is important to use a standard benchmark such as that provided by Randen [55].

There are many other configuration parameters that could be varied; *learning rate, batch size, variations of the training data, different number of layers*, but for the purpose of this work, the results show first, the capability of deep learning architectures for segmentation of textured images and second, in some cases better results than traditional methodologies. However, the configuration of the network is not trivial and variations of some parameters can provide sub-optimal results. The experiments conducted in this work did not provide conclusive evidence for the selection of any of the parameters evaluated. Furthermore, training of the networks requires considerable resources. The training times for the images with 5 textures took around 5 hours and for the image with 16 textures around 96 hours on a Apple (Cuppertino, CA, USA) Mac Pro (Late 2013) with a 3.7 GHz Quad-Core and 32 GB Memory with Dual AMD FirePro D300 graphics processors.

Therefore, it can be concluded that U-Net convolutional neural networks can be used for texture segmentation and provide results that are comparable or better than traditional texture algorithms. Furthermore, these results encourage the application of deep learning to other areas. If we assume that different textures are characterised by patterns, i.e., repetitions of certain sequences or particular variation of intensities, then any data which is characterised by patterns could be analysed. For instance, phonemes in human speech have different patterns, which when combined form words. Thus one line of an image with different textures would have similar characteristics as the intensity variation of a phrase with different phonemes. Moreover, voice signals, which are one-dimensional can be converted into two-dimensional spectrograms [66] with time on one axis and frequency in another axis. In these cases, the spectrograms can be analysed for texture directly.

**Author Contributions:** Conceptualization: C.K., J.V., N.R.M. and C.C.R.-A.; Methodology C.K. and C.C.R.-A., writing, reviewing and editing, C.K., J.V., N.R.M. and C.C.R.-A.; funding acquisition J.V. and C.C.R.-A.

**Funding:** This work was funded by the Leverhulme Trust, Research Project Grant RPG-2017-054. C.K. is partially funded by the School of Mathematics, Computer Science and Engineering at City, University of London.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Review*

## **Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences**

### **Marco Buzzelli**

Department of Informatics, Systems and Communication, University of Milano-Bicocca, Viale Sarca 336, 20126 Milan, Italy; marco.buzzelli@unimib.it

Received: 9 July 2020; Accepted: 24 July 2020 ; Published: 27 July 2020

**Abstract:** We present a review of methods for automatic estimation of visual saliency: the perceptual property that makes specific elements in a scene stand out and grab the attention of the viewer. We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences. For each domain, we perform a selection of recent methods, we highlight their commonalities and differences, and describe their unique approaches. We also report and analyze the datasets involved in the development of such methods, in order to reveal additional peculiarities of each domain, such as the representation used for the ground truth saliency information (scanpaths, saliency maps, or salient object regions). We define domain-specific evaluation measures, and provide quantitative comparisons on the basis of common datasets and evaluation criteria, highlighting the different impact of existing approaches on each domain. We conclude by synthesizing the emerging directions for research in the specialized literature, which include novel representations for omnidirectional images, inter- and intra- image saliency decomposition for co-saliency, and saliency shift for video saliency estimation.

**Keywords:** co-saliency; omnidirectional images; video saliency; visual saliency estimation

#### **1. Introduction**

Visual saliency is defined as a property of a scene in relation to an observer. This follows from a commonly-accepted interpretation [1–3] that defines it as the set of subjective and perceptual attributes that make certain items stand out from their surroundings, and therefore grab the viewer's attention.

In the vision system of human beings and other animals, two components typically contribute to the overall saliency: bottom-up and top-down factors [4]. Bottom-up saliency is driven by low-level activations in the vision system, based for example on pre-attentive computational mechanisms in the primary visual cortex [5], and does not depend on specific tasks and objectives. Conversely, top-down saliency is defined as being goal-directed [6], and as such it is highly dependent on the intrinsic biases of the observer, and correlated to the semantics of the depicted elements. Scientific literature reviews for automatic visual saliency estimation often adopt these two categories to classify existing methods [2]. For example, deep learning solutions are rightfully labeled as top-down approaches due to their intrinsic ability to extract and exploit semantic pieces of information [7], whereas hand-crafted methods tend to rely on lower-level features such as contrasting patterns, and are therefore categorized as bottom-up solutions. In practice, though, multiple interacting factors (both top-down and bottom-up) are considered to determine which parts of the scenes are further processed by the attentional process of the biological vision system [8].

Properly modeling visual saliency means emulating the widest set of factors that influence the evaluation of saliency as performed by a human being. This goal has been pursued by many authors, both in the neuroscience community and, more recently, in the computer vision and image processing communities. Due to the different levels of involved complexity, bottom-up saliency estimation methods are generally faster than top-down methods [9], and thus useful in applications where real-time feedback is considered more important than reaching higher accuracy. For example, in a live augmented reality scenario, fast saliency estimation would locate image regions deemed important for further localized computer vision analysis, and would provide precious information to avoid covering areas of potential interest with the rendering of augmentation elements. Conversely, top-down saliency estimation methods tend to be more robust, at the cost of a higher demand for computational resources. These are therefore typically employed in applications with looser time constraints, and which benefit from semantic interpretation. For example, a system for storing and organizing personal photos could exploit saliency estimation to detect objects of interests based on visual composition, and their reoccurring presence in multiple photos. In general, visual saliency estimation has been successfully employed in multiple tasks, such as image retargeting [10], video summarization [11], and photo-collage creation [12]. It has also been adopted as an intermediate pre-processing step for other computer-vision problems, such as scene recognition [13], object detection [14] and segmentation [15]. Since the advent and diffusion of deep-learning, many of these problems have been reformulated in an end-to-end fashion that does not rely on explicitly estimating the salient component, as proven by state of the art solutions in each field [16–18]. There exist, however, problems that remain directly related to the evaluation of saliency information such as advertisement assessment [19], and domains where its explicit computation is particularly relevant, for example in reducing the computational effort for analysis of large quantities of data (such as video sequences, or high-resolution panoramic images exploited in the virtual reality domain).

By analyzing the recent scientific literature on saliency estimation, in fact, specific topics emerged as persistently reoccurring amidst works dedicated to saliency on regular images, due to a combination of the excellent results already reached by the scientific community, and the paradigm shift in solving certain problems without explicitly modeling general-purpose saliency. Such trending topics are, namely, saliency in omnidirectional images, and multiple-input scenarios, which include co-saliency and video saliency estimation. Although visual saliency has been studied in other fields as well (such as light field and hyper-spectral imaging) most of the current domain-specific research happens to converge on the three mentioned topics, while the literature does not offer enough material to produce a valuable review of recent solutions related to other less widespread domains. Our goal is therefore to highlight the recent trends of research in these fields, providing a concise yet exhaustive insight into each analyzed method, and summarizing the similarities and differences across different solutions. The investigated domains are either very recent, or have lived a particularly dynamic evolution. As a consequence, different methods are typically evaluated and/or optimized on different datasets, making comparative evaluations extremely challenging. Nonetheless, we conduct an analysis on the joint occurrence of methods and datasets, and we benchmark solutions that are directly comparable as they were evaluated in equivalent conditions.

Accompanying the development of research into visual saliency estimation through the years, the scientific literature has periodically offered different benchmarks and surveys, typically concerning general-purpose saliency. Borji et al. (2012) [20] provide an in-depth comparison of 35 state of the art methods for saliency estimation, over both synthetic and natural images. A second work by Borji et al. (2015) [9] conduct a similar benchmark including newly developed solutions, the most recent of which, however, was released in the year 2014. A more recent review is presented by Wang et al. (2019) [21], offering an in-depth survey over methods for salient object detection specifically based on deep learning approaches. Concerning domain-specific analyses, Cong et al. (2019-I) [22] cover methods for saliency detection that rely on so-called "comprehensive information", such as depth cues, inter-image correspondence (equivalent to co-saliency), and multiple frames. Zhang et al. (2018) [23] review the concepts, applications, and challenges intrinsic into co-saliency detection, whereas Riche et al. (2016) [24] focus on video saliency estimation approaches

based on a bottom-up interpretation. With the current survey, our goal is to inform on up-to-date developments in the fields of domain-specific visual saliency estimation. To the best of our knowledge, there are currently no surveys that specifically focus on saliency estimation in omnidirectional images, which is the most recent domain-specific development in the field.

The main contributions of this paper are the following:


The rest of the paper is structured as follows: Section 2 presents the systematic approach that led to the selection of works in this review. Section 3 introduces the three domains of interest and their peculiarities, followed by the description of different interpretations and representations commonly adopted for visual saliency, and an overview of existing metrics and measures used to assess saliency estimation algorithms. The subsequent sections present methods, datasets, and measures for each domain of interest: Section 4 focuses on omnidirectional images, Section 5 relates to co-saliency estimation, and finally, Section 6 presents developments in the field of video saliency.

## **2. Methodology for Literature Review**

The selection of literature works included in this review paper has been determined through a systematic approach, which is described in the following.

The initial prompt was to observe and highlight the current trends in visual saliency estimation. With this objective, we performed a keyword-based search on the academic search engine Google Scholar, using the terms: "visual saliency", "saliency estimation", "salient object detection". Given the time-sensitive nature of our goal, we restricted the results to works published no earlier than 2017. For each resulting paper, we retrieved the following information:


For the years 2018 and 2017, we restricted the number of results to those having collected at least one citation at the time of the review, intending to focus on the dissemination of works that are considered relevant by the scientific community. Based on the title and abstract analysis, then, we excluded some further results:

• Works that do not fit in the field of visual saliency (retrieved due to the unreliability of keyword-based search alone);

• Works that focus on extremely narrow tasks (e.g., saliency estimation for skin lesions, or for comic strips).

Works related to datasets and surveys have also been isolated and used as a reference for the corresponding sections. We annotated the remaining results in terms of domains of application, and the most recurring themes emerged as being: saliency estimation for omnidirectional images, co-saliency estimation, and saliency estimation for video sequences. We therefore focused on these domains to provide the scientific community with an analysis of relevant and recent developments. For each of the selected works, domain-specific and cross-domain characteristics have been collected through careful study of the corresponding manuscripts.

The final selection of recent and relevant methods for saliency estimation has then been used as the starting point to identify the associated evaluation measures and the associated datasets. Evaluation measures have been classified as either general-purpose (presented in Section 3.3), or domain-specific (presented in the corresponding Sections 4–6).

In virtue of the importance of data in training and assessing methods for visual saliency estimation, we dedicated for each domain an in-depth analysis of the corresponding datasets. A matrix describing the joint occurrences of datasets and methods has been defined and presented for all three domains. At this stage, no explicit constraint on the release date has been imposed: the rationale is that if a dataset is still widely adopted as a benchmark for new methods, it is to be considered relevant and worth mentioning. Multiple instances of the same dataset being reported with different names have been identified and merged. Conversely, whenever two or more saliency estimation methods refer to the same dataset in different versions, this piece of information has been annotated and reported. Finally, all datasets that were identified during the preliminary, keyword-based, search have been found to be already present in the current selection. Detailed characteristics of the identified datasets have been presented.

#### **3. Visual Saliency Estimation**

In this section, we describe the recently emerged domains for visual saliency, and provide background information about the different types of saliency representation, as well as commonly used evaluation measures.

#### *3.1. Domains*

The scientific literature on saliency estimation has witnessed the emergence of domain-specific solutions, covering a wide range of topics that go beyond the traditional regular-image input. Specifically, recent developments have shown several works in the domains of omnidirectional images, image groups for co-saliency estimation, and video sequences, as exemplified in Figure 1. Other fields of application, such as light field [25] and hyper-spectral imaging [26,27] have also caught the attention of saliency-related research. Visual saliency estimation is, however, still at its early stages in such domains, and a full review of related methods is therefore left as a future development.

**Figure 1.** Visualization of the three domains of interest for visual saliency estimation: omnidirectional images (**a**), image groups for co-saliency (**b**), and clips for video saliency (**c**).

Another relevant domain of research is depth-assisted visual saliency estimation, which explores the advantages of predicting saliency on so-called RGB-D images [28]. In this case, the additional knowledge associated to the distance between the camera and the depicted elements can improve the separation of subjects from the background, providing a precious piece of information for better saliency estimation. Despite the clear relevance of the topic, we chose not to explicitly discuss this domain since such a wide field deserves a whole dedicated survey paper. Nonetheless, we found that depth-assisted saliency estimation is sometimes included in the analyzed domains of omnidirectional images [29], co-saliency [30], and video saliency [31]. We will, therefore, reference and discuss only these works in the corresponding sections.

**Omnidirectional** images (ODIs) are panoramic representations of a scene, covering a 360◦ solid angle from a single viewpoint, typically employed in passive virtual reality. Virtual reality is the experience of a simulated world, which can be navigated by the user to varying degrees of freedom [32]. In a passive virtual reality scenario, the spatial movements are predefined, and the virtual environment is precomputed in a sequence of omnidirectional images. During fruition, the user only determines the direction of viewing, i.e., the subpart of the ODI to visualize at any given time. Image cropping for thumbnail selection is particularly valuable when operating on large omnidirectional images, depicting wide sceneries in high-resolution [33]. Storing and transmitting these ODIs can then benefit from perceptually-aware compression, i.e., reducing the represented detail over areas that are considered "less-interesting" [34].

Image **co-saliency** refers to the problem of estimating the saliency from a group of images that depict the same subject. The rationale behind this approach is to provide the saliency estimation model with additional information, and thus partially compensate for the ill-posed nature of the problem. Depending on the chosen level of abstraction, image groups for co-saliency estimation could either represent exactly the same instance from multiple points of view, or different instances of the same category, possibly characterized by slight variations in appearance.

**Video** saliency is the task of performing saliency estimation on a sequence of frames. By considering the time component, in fact, estimation of visual saliency acquires additional value in terms of understanding how people react to, and learn from images [35]. If we exclude the naive frame-by-frame approach, multiple-frame analysis helps a given model gain a global view of the input, in a fashion similar to what happens with co-saliency estimation. In addition, the annotated sequences are expected to exhibit different patterns compared to single image saliency, as the vision of each single video frame is both limited in time and highly influenced by the previous frames.

Cross-talk between domains is of course highly present, with approaches aiming at video saliency estimation in omnidirectional images [36], as well as co-saliency estimation in video sequences [37].

#### *3.2. Saliency Representation*

Ground truth for visual saliency estimation is typically collected and distributed in one of three possible representations: scanpaths, saliency maps, and salient object regions. These are visually shown in Figure 2.

**Figure 2.** Different representations used for visual saliency ground truth: fixation points relative to scanpaths (**a**), continuous saliency maps directly related to fixations (**b**), and sharp salient object regions (**c**).

#### 3.2.1. Fixations and Scanpaths

Human eyes have been shown to explore a given scene in saccades, which are rapid movements from a point of interest to another. Between saccades, a temporary pause, called a fixation, is spent in the area of the point of interest [38]. The ordered sequence of fixations is called scanpath [39], or gaze trajectory [33], and it is the first and most direct way to represent the salient areas of an image. In some cases, such as omnidirectional images explored with virtual reality displays, gaze trajectory is complemented with head trajectory [33,40,41], tracking the movement of the whole head of the viewing subjects.

#### 3.2.2. Fixation-Related Saliency Maps

Scanpaths can be processed by discretizing fixations coordinates to pixel coordinates. The result is a scattered map of pixel saliency, which is typically convolved with a bidimensional Gaussian kernel [41] in order to create a proper saliency map through kernel density estimation [42]. This type of representation removes, by definition, the temporal relationship between fixations, which can be considered non-necessary for specific tasks, such as thumbnail selection [33]. Mixed representations have been proposed, such as saliency volumes [35], giving the possibility to produce both scanpaths and saliency maps.

#### 3.2.3. Salient Object Regions

Another commonly used representation for saliency information consists of pixel-precise binary segmentation maps. This type of annotation can be generated by pre-segmenting each element of the scene and subsequently selecting one, or more, segments that overlap with the largest amount of fixations [43] or explicit selections [44]. Alternatively, one or multiple users can be asked to directly provide a hand-drawn segmentation of the area they consider most relevant [45]. In the case of multiple proposals, these are then reduced to one annotation with a predefined aggregation strategy.

Different representations are more suited to different applications. For example, temporally-aware scanpaths can be useful to determine the optimal path of a virtual camera in an omnidirectional video [46]. Continuous saliency maps can be employed for saliency-aware image compression, specifically tuning non-uniform bit allocation as a function of the estimated local saliency [47], while a sharp salient object estimation is typically used for automatic or semi-automatic object segmentation in photo-manipulation tools [48]. There is no hard evidence that explicitly optimizing for one representation may help improving performance on others, and methods tend to be developed for clusters of datasets sharing the same type of salient representation, with a few isolated exceptions [35,49].

#### *3.3. Evaluation Measures*

We provide a selection of the evaluation measures most commonly used by the saliency estimation methods analyzed in this paper. An exhaustive review of evaluation measures for saliency models is provided by Riche et al. [50]. Domain-specific measures will also be presented, when existing, in each of the subsequent sections, regarding omnidirectional images, co-saliency, and video saliency estimation.

In the following, we categorize the selected evaluation measures on the basis of the involved ground truth representation. We will refer to the predicted saliency map as *P*, and to the corresponding ground truth as *G*. Some formulations will rely on the sum-normalized versions *P* 0 and *G* 0 . *X* will be the total number of image pixels, and *F* the total number of fixations.

#### 3.3.1. Measures for Fixations, Scanpaths, and Saliency Maps

The **Pearson Correlation Coefficient** (CC) [51] is a measure of the linear correlation between prediction *P* and ground truth *G* considered as two statistical variables:

$$\text{CC} = \frac{\text{cov}(P, G)}{\sigma\_P \sigma\_G} \tag{1}$$

where cov(·, ·) is the co-variance, *σ<sup>P</sup>* and *σ<sup>G</sup>* are the standard deviation values for, respectively, the predicted saliency data and ground truth saliency data.

The **Normalized Scanpath Saliency** (NSS) [52] is used to compare a densely-estimated saliency map with a fixation-based ground truth. Specifically, it is the average of the estimated saliency values *P* in the locations indicated by eye fixations *f* :

$$\text{NSS} = \frac{1}{F} \sum\_{f=1}^{F} \frac{P(f) - \mu\_P}{\sigma\_P} \tag{2}$$

Note that the saliency estimation map is normalized to have zero mean and unitary standard deviation through corresponding statistics *µ<sup>P</sup>* and *σP*. In this scenario, a null NSS indicates a correspondence between estimation and ground truth equivalent to random chance. Conversely, very high or very low NSS suggests a high correspondence or anti-correspondence.

The **Kullback–Leibler** (KL) [53] divergence between two saliency maps considered as probability density functions, is computed as:

$$\text{KL} = \sum\_{\mathbf{x}=1}^{X} \mathcal{G}'(\mathbf{x}) \cdot \log \left( \frac{\mathcal{G}'(\mathbf{x})}{P'(\mathbf{x}) + \gamma} + \gamma \right) \tag{3}$$

where *γ* is a protection constant.

The **SIMilarity measure** (SIM) [54], also called histogram intersection, compares two different saliency maps when viewed as normalized distributions:

$$\text{SIM} = \sum\_{\mathbf{x}=1}^{X} \min\left(P'(\mathbf{x}), G'(\mathbf{x})\right) \tag{4}$$

The **Earth Mover's Distance** (EMD) [55] quantifies the minimal cost to transform probability distribution *P* into *G*:

$$\text{EMD} = (\min\_{f\_{\bar{i}j}} \sum\_{i,j} f\_{i\bar{j}} d\_{i\bar{j}}) + |\sum\_{i} G\_{i} - \sum\_{\bar{j}} P\_{\bar{j}}| \max\_{i,j} d\_{i\bar{j}} \tag{5}$$

with:

$$\sum\_{i,j} f\_{ij} = \min \left( \sum\_{i} G\_i - \sum\_{j} P\_j \right) \tag{6}$$

where *dij* represents the difference between bin *i* in *G* and bin *j* in *P*.

#### 3.3.2. Measures for Salient Object Regions

The **Precision/Recall curve** is computed by varying a binarization threshold on the continuous saliency estimation map *P*, and computing at each level the precision PR and recall RE components:

$$\text{PR} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{7}$$

$$\text{RE} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{8}$$

where TP is the number of True Positive pixels, FP False Positives, and FN False Negatives, obtained by comparing predicted saliency *P* with ground truth map *G*.

The **F-measure** (*Fβ*) [56] corresponds to the weighted harmonic mean between precision PR and recall RE:

$$F\_{\beta} = \frac{\left(1 + \beta^2\right) \text{PR} \cdot \text{RE}}{\beta^2 \text{PR} + \text{RE}} \tag{9}$$

In this case, the continuous-valued saliency estimation can be binarized with different techniques before effectively computing precision and recall. Furthermore, it is common practice [9] to give more weight to the precision component (considered more important than recall for the saliency estimation task), by setting parameter *β* 2 to 0.3.

The **Mean Absolute Error** (MAE) is computed directly on the prediction, without any threshold, as:

$$\text{MAE} = \frac{1}{X} \sum\_{\mathbf{x}=1}^{X} |P(\mathbf{x}) - G(\mathbf{x})| \tag{10}$$

The **Structure measure** (*Sα*) [57], inspired by the structure similarity (SSIM) from image quality assessment, is the weighted mean between region-aware structural similarity *S<sup>r</sup>* and object-aware structural similarity *So*:

$$\mathcal{S}\_{\mathfrak{a}} = \mathfrak{a} \cdot \mathcal{S}\_{\mathfrak{o}} + (1 - \mathfrak{a}) \cdot \mathcal{S}\_{\mathfrak{r}} \tag{11}$$

where *S<sup>r</sup>* covers the object-part similarity with the ground truth, while *S<sup>o</sup>* accounts for the global similarity based on sharp estimation contrast and uniform distribution.

The **enhanced-alignment measure** (*Q*) [58] captures both pixel-level matching and image-level statistics as:

$$Q = \frac{1}{X} \sum\_{\mathbf{x}=1}^{X} \frac{1}{4} (1 + \mathfrak{f}(\mathbf{x}))^2 \tag{12}$$

where:

$$\zeta = \frac{2\varphi\_G \circ \varphi\_P}{\varphi\_G \circ \varphi\_G + \varphi\_P \circ \varphi\_P} \tag{13}$$

Bias matrix *<sup>ϕ</sup>*{*G*,*P*} is the distance between each value of the binary map (*G* or *P*) and its global mean, and the two matrices are compared through the Hadamard product (◦).

#### 3.3.3. Representation-Independent Measures

**Area Under Curve (AUC)** is the area under the Receiver Operating Characteristic (ROC) curve. The latter is computed by varying the binarization threshold and plotting False Positive Rate (FPR) against True Positive Rate (TPR):

$$\text{TPR} = \text{RE} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{14}$$

$$\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} \tag{15}$$

Variants of the general concept of AUC take into consideration data distribution at various levels, in order to normalize the evaluation of estimated saliency. These include AUC-Judd [54], AUC-Borji [20], AUC-Zhao [59] and AUC-Li [60]. The AUC measure has been used to evaluate saliency estimation under different representations: from fixations, scanpaths, and saliency maps [34,36,61–66] to salient object regions [67–70].

#### **4. Omnidirectional Images**

Omnidirectional images, also known as 360◦ images, or panoramic images, present a set of domain-specific peculiarities. An omnidirectional image is digitally stored in equirectangular format, projecting a spherical surface into a planar and rectangular one. Any such projection inevitably

introduces distortions in the representation, as a direct consequence of the *Theorema Egregium* [71,72], therefore saliency estimation methods for regular images would behave sub-optimally without a proper adaptation. For this reason, several methods specifically aimed at omnidirectional images focus on producing an alternative projection or transformation, that fully exploits existing approaches for classical image saliency estimation [35,36,73].

When a user explores an omnidirectional image, he/she normally uses a head-mounted display to freely navigate the scene. In this case, only a portion of it is shown at any given time, under a so-called Normal Field of View (NFoV), which introduces less-noticeable distortions. The starting point of view, a non-ODI thumbnail, and a suggested exploration pattern can all be optimized for the best user experience by exploiting saliency estimation.

Saliency maps related to several omnidirectional datasets have been observed to exhibit a bias in fixations close to the equator line of view [33,34,41]. This bias has been exploited by different methods [62,63] to produce more accurate estimations.

#### *4.1. Methods for Omnidirectional Images*

Table 1 presents a synthetic overview of recent methods for saliency estimation in omnidirectional images. All analyzed methods target a ground truth in the form of fixation saliency maps. They all produce a continuous saliency map output related to fixation data, whereas only a limited subset also explicitly predicts scanpath trajectories [34,35].

**Table 1.** Characteristics of recent methods for visual saliency estimation in omnidirectional images. The "Target" column indicates the nature of the ground truth used to train or develop the methods, while "Output" describes what data they are explicitly able to generate (FM = Fixation maps, SP = Scanpaths).


Part of the research in this field consists of evaluating the transferability of existing methods originally designed for classical images [33,34,61,63,73]. Sitzmann et al. [33] initially collected the SVR (Saliency in Virtual Reality) dataset. Through observations on the extensive and diverse set of acquisitions, they acquired knowledge about fixation bias, which they used to improve upon existing saliency estimation solutions when applied in the field of omnidirectional images. They applied the developed method to a wide range of use cases, including automatic montage and summarization of videos, thumbnail extraction, and video compression. De Abreu et al. [34] gathered data only relative to the whole head movement, instead of tracking the viewers' eyes, when collecting their own dataset. The authors first proposed a method to convert this information into saliency maps. They then observed a fixation bias as well, which is addressed using the proposed Fused Saliency Maps (FSM) method, operating on existing saliency estimation solutions. Monroy et al. [61] presented an architectural extension that can be applied to any existing neural network for saliency estimation, in order to fine-tune it to the specific domain of omnidirectional images. The underlying idea is the extraction of six undistorted patches of the panoramic view, their independent evaluation, and subsequent fusion.

As previously mentioned, some methods devised specific representations of the input data, that allow for full exploitation of the domain, but without suffering from its intrinsic disadvantages

(namely, image distortions) [36,39,73]. Assens et al. [35] propose a novel representation, called saliency volume, to extract saliency information that can be adapted to different forms: from the time-independent saliency maps, to the ordered scanpaths (extracted through specific sampling strategies), to a hybrid representation, which consists in temporally weighted saliency maps. Cheng et al. [36], who also collected the Wild-360 dataset, presented a weakly-supervised training for a spatial-temporal neural network architecture. They also proposed working on a six-face cube projection, in order to mitigate the heavy distortions of equirectangular projection, and implemented so-called cube padding to hide the discontinuities of representation to the neural network processing. Maugey et al. [73] proposed an aggregation technique for the application of existing saliency estimation methods to different map projections. They mitigate the discontinuities introduced at the edge of 2D representations by performing a double cube projection, the results of which are eventually merged. They also proposed the automation of a navigation pattern that maximizes exposition to estimated salient areas.

Despite the clear dominance of machine learning approaches to saliency estimation (and, specifically, deep learning approaches), a good deal of recent methods for omnidirectional saliency are based on hand-crafted design and combination of visual features [29,33,34,62–64,73]. Ling et al. [62] defined a hand-crafted approach to saliency estimation for omnidirectional images. Their color-dictionary sparse representation (CDSR) is applied in conjunction with multi-patch analysis to simulate human color perception. Fixation bias is also taken into consideration for the specific characteristics of the domain. Lebreton et al. [63] extended existing solutions to the estimation of saliency in omnidirectional images, namely Boolean Map Saliency (BMS) and Graph-Based Visual Saliency (GBVS). They then defined a novel framework, called Projected Saliency, to adapt existing estimation models with a simple mechanism, which allowed extensive analysis of features interaction in computational saliency models. Fang et al. [64] developed a hand-crafted solution based on the fusion of feature contrast and boundary connectivity, leaning on the figure-ground law from Gestalt Theory. Boundary connectivity is designed to describe the spatial layout of the image region with an upper and a lower boundary. Feature contrast is based on luminance and color features from the CIE Lab color space. Battisti et al. [29] presented a hand-crafted approach based on low-level image descriptors, such as edges and texture features. They also exploit a depth description of the image itself, to produce a more robust estimation of image saliency, which is evaluated using metrics such as the Kullback–Leibler divergence, and the correlation coefficient.

Methods based on deep learning [35,36,61] are built on the basis of existing Convolutional Neural Network (CNN) backbones, such as the VGG\_CNN\_M [74] and VGG-16 [75] architectures (from the Visual Geometry Group), and the residual-learning-based ResNet-50 [76] architecture.

#### *4.2. Datasets for Omnidirectional Images*

Table 2 presents a synthetic overview describing the adoption of different datasets by different methods for visual saliency estimation in omnidirectional images. The most frequently adopted dataset is the one published with the Salient360! challenge [41], in some cases based on an old version of the same set [40]. The iSUN dataset [77] (interactive Scene UNderstanding) was used by Assens et al. [35] to pre-train their solution, but does not involve omnidirectional images. The MIT dataset [78] from Massachusetts Institute of Technology was adopted for evaluation by Maugey et al. [73], but does not contain saliency ground truth information.

A detailed description of all the relevant datasets is consequently presented in Table 3. These can be differentiated first and foremost by the stimuli characteristics, ranging from image resolution (when stored in equirectangular format), to duration of the exposition to the stimulus itself. The display device is typically either a head-mounted display such as Oculus Discovery Kit 2 (DK2), or a classical computer screen. In the latter case, the image is visualized in Normal Field of View, allowing the user to navigate the whole panorama with the use of mouse and keyboard.



**Table 2.** Dataset/method matrix for visual saliency estimation in omnidirectional images.

**Table 3.** Datasets for visual saliency in omnidirectional images and related characteristics (FM = Fixation maps, SP = Scanpaths).


All analyzed datasets provide a ground truth in terms of either scanpaths (for eyes and head movement) or fixations-related saliency maps, i.e., without precisely-annotated salient object regions, possibly due to the intrinsic difficulty in segmenting equirectangular projection images.

The Salient360! [41] dataset was created for the Grand Challenge "Salient360!" organized in conjunction with ICME 2017 (International Conference on Multimedia and Expo). The dataset has been updated through the years [40], with the last edition also including a set of omnidirectional video clips. It is supplied with a script toolbox for the evaluation of predicted scanpaths and saliency maps.

The SVR [33] (Saliency in Virtual Reality) dataset is a collection of both head and eye orientation data (scanpaths), coming from the observation of 22 stereoscopic omnidirectional images. The environmental condition of the stimuli include combinations of users being seated or standing, with or without a head-mounted display. In all conditions, an eye tracking device was used.

Wild-360 [36] is an exclusively video-based dataset for omnidirectional saliency. The original clips were retrieved from YouTube using specific keywords such as "nature", "wildlife", and "animals", in order to collect a dataset with heterogeneous and dynamic contents. The video sequences were manually annotated by multiple users, without any head- or eye- tracking device.

LAY [34] (Look Around You) was built with the objective of developing saliency estimation methods without the support for an eye tracking device for data collection. Specifically, the head orientation of the viewers (called Viewport Center Trajectory) is used as a proxy ground truth for the generation of saliency maps. Different experiments have been conducted by varying the viewing time of each stimulus.

#### *4.3. Evaluation of Saliency for Omnidirectional Images*

Methods for saliency estimation in omnidirectional images are evaluated with a variety of measures, most of which are common to visual saliency in traditional images, such as the Pearson correlation coefficient CC ([29,33,36,61–64]), and the area under the ROC curve AUC ([34,36,61–64]).

The Salient360! benchmark[41] introduced, among other criteria, an evaluation based on the Kullback–Leibler divergence (KL). Although not specifically designed for omnidirectional images, this has been widely adopted as an evaluation measure[29,61–64] thanks to Salient360! being the de-facto reference for saliency in omnidirectional images.

Regarding domain-specific evaluation, the same benchmark also introduced the evaluation of scanpaths based on the comparison metric by Jarodzka et al. [83] properly adjusted to incorporate orthodromic distances in 360◦ instead of Euclidean distances. The original metric is based on a comparison between each fixation from the prediction with all the fixations from the provided ground truth. Such comparison is applied on the basis of multiple elements, namely the spatial proximity of starting points, the difference in direction and magnitude of the saccades, and the temporal proximity of saccade midpoints.

Based on the dataset/method matrix in Table 2, the Salient360! dataset is the best candidate benchmark to compare the largest subset of selected methods. Results are presented in Table 4 according to four different metrics reported in the corresponding publications. The VGG-based model by Assens et al. [35] has been excluded as it does not report performance on metrics comparable with other methods. Ling et al. [62] generates in absolute terms the best results across all considered measures, while the second-best is the model by Fang et al. [64], according to three measures out of four. Both solutions are based on hand-crafted algorithms with a specific focus on emulating the color perception in human vision. Omnidirectional images, therefore, would appear to represent a domain where manually-defined criteria still outperform machine learning, possibly due to the stronger positional bias, and to image distortions that are uncommon in large datasets used for neural network pre-training [84].


**Table 4.** Quantitative comparison of selected methods for saliency estimation in omnidirectional images, on the Salient360! challenge dataset. Best results are highlighted in boldface.

#### **5. Co-Saliency**

The concept of co-saliency was first introduced by Toshev et al. [85] to address the problem of image matching, exploiting local point feature correspondence and region segmentation. By its original definition, therefore, co-saliency estimation refers to determining the common element from two or more instances of exactly the same subject. A more general interpretation would extend the concept to groups of images depicting different instances of the same category [86–88] (e.g., many images of different lions). Regardless of the specific definition, the presence of multiple images can provide a useful constraint in the otherwise ill-posed problem of saliency estimation, thanks to the assumption that all images (or a subgroup [89]) contain the same salient element.

Co-saliency estimation is often encountered along with other related tasks, namely co-segmentation [90] and co-localization [91]. While the output of a method for co-saliency is a continuous map, representing the probability of each pixel being salient, the output of a method for co-segmentation is typically a binary mask, that precisely separates the foreground from the background. Following a similar abstraction, co-localization refers to generating a bounding-box over common elements in multiple images.

#### *5.1. Methods for Co-Saliency*

Table 5 presents a selection of recent methods for co-saliency estimation that were well received by the scientific community. All presented methods target a binary salient object region ground truth. The output of these methods is a continuous-valued saliency map, which is, however, optimized to be as sharp as possible. In some cases [37,68,69,92], the methods also produce a segmentation-oriented binary mask.

**Table 5.** Characteristics of the analyzed methods for co-saliency estimation. The "Target" column indicates the nature of the ground truth used to train or develop the methods, while "Output" describes what data they are explicitly able to generate (OR = Object Regions, SM = Sharp saliency Maps, BM = Binary Masks).


The co-saliency domain involves, by definition, the analysis of multiple images. How these are handled can help in differentiating among different methods for co-saliency estimation. Early-fusion techniques [70,89] initially extract a global representation of all the images in the input group, capturing relationships between different images. Conversely, late-fusion techniques [67,92,94,95] are designed to estimate single-image saliency from each input individually, and reciprocally update them in a second phase, based on the extracted information.

Joining the efforts of early and late fusion techniques, are methods that exploit both approaches by extracting so-called "intra-image saliency" (i.e., from each individual image) as well as "inter-image saliency" (as the correspondence among multiple images), to eventually combine them [37,68,69,96]. Cong et al. (2018)[96] proposed computing intra-saliency maps exploiting the depth information associated with each image, and calculating the inter-saliency maps based on multi-constraint feature matching to improve the overall performance. A cross-label-propagation scheme was adopted to optimize and refine both maps in a cross-way, eventually integrated into a final co-saliency map. In a subsequent work, Cong et al. (2019-II) [93] formulated the inter-image correspondence as a hierarchical sparsity reconstruction framework. They addressed image-pairs correspondences through a set of pairwise dictionaries, and global image group characteristics through a ranking-scheme-based common dictionary. A three-term energy function refinement model is introduced in order to improve the intra-image smoothness and inter-image consistency. Wang et al. [37] extended the concept of co-saliency from images to videos, and as such operate on multiple input video sequences. They took into consideration both inter-video foreground correspondences and intra-video saliency stimuli, with the objective of ignoring background distraction elements and concurrently emphasizing salient foreground regions. Tsai et al. [68] observed that the auxiliary task of co-segmentation improves object boundaries in co-saliency detection, and proposed a joint optimization of the two tasks by solving an energy minimization problem over a graph. The resulting model iteratively transfers useful information in both directions, to improve the prediction of both domains. The solution by Jeong et al. [69] produces an initial set of co-saliency maps, which are then refined on object boundaries. The authors then introduced a seed-propagation step over an integrated multilayer graph, aimed at detecting regions missed by lower-level descriptors. Such descriptors are pooled both within-segment and within-group, in order to handle input images having different sizes.

Another possible criterion to discriminate among different approaches, is the distinction between deep-learning solutions, and those based on hand-crafted design and traditional techniques. Methods in the deep learning group [67,70,94,95] typically benefit from end-to-end learning, therefore optimizing the final objective of co-saliency estimation regardless of the adopted early-fusion or late-fusion approach. Many are based on the Fully-Convolutional Network (FCN) by Long et al. [97] or DeepLab by Chen et al. [18], both leveraging the VGG backbone [75]. Other adopted neural architectures include the "Slow" CNN-S model by Chatfield et al. [74]. Zhang et al. [67] presented a coarse-to-fine framework for co-saliency detection: they first generate an initial proposal using a mask-guided fully convolutional network, based on the high-level feature response maps of a pre-trained VGG network [75]. They then defined a multi-scale label smoothing model to refine the prediction, optimizing both the label smoothness of pixels and superpixels. Wei et al. [70] presented an end-to-end co-saliency estimation neural network. The model adopts an early-fusion approach by extracting high-level descriptions of the input images, and capturing the group-wise interaction information for group images. It was proven to be able to learn the collaborative relationships between single-image features and group-wise features. Hsu et al. [94] presented an original unsupervised approach to co-saliency estimation, addressed in a graphical model based on two losses: the single-image saliency (SIS) loss, acting as the unary term, and the Co-occurrence (COOC) loss, acting as the pairwise term. The authors also presented two refining extensions, namely boundary preservation and map sharpening. Zheng et al. [95] presented FASS: a feature-adaptive semi-supervised framework for co-saliency estimation. The proposed solution addresses and exploits the difference in efficacy of image features, by a joint formulation of element-level feature selection and view-level feature weighting. It optimizes co-saliency label prorogation over both labeled and unlabeled image regions.

The purely hand-crafted methods include the aforementioned video co-saliency solution by Wang et al. [37], and the more recent work by Jerripothula et al. [92]. Specifically, the latter focuses on

predicting the saliency map for one selected key image, and subsequently extending the prediction to other images in the group. The authors proposed fusing individual saliency maps using the "dense correspondence" technique, and evaluating a no-reference concentration-based saliency quality to decide whether the fused saliency map improves upon the original one.

Finally, crossing the gap between deep learning solutions, and purely hand-crafted ones, are all those traditional methods that exploit the extraction of high-level deep features from a pre-trained model, as a descriptor to be used in combination with other pieces of information for co-saliency estimation [68,89,96]. A notable example is represented by Yao et al. [89], who generalized the problem of co-saliency estimation to the case where multiple object categories are present in the input image group. The task has been therefore decomposed into two sub-problems: automatically identifying subgroups of images, based on multi-view spectral rotation co-clustering, and subsequently extracting the co-saliency information from such groups.

#### *5.2. Datasets for Co-Saliency*

Table 6 presents the combination of methods and datasets used in the corresponding experiments for training and evaluation. The most frequently adopted datasets are iCoseg [98] and various versions of the MSRC from Microsoft Research [86]. The latter is particularly old, the first version going back to 2005 as it was originally collected for a different purpose than saliency estimation. Different updates of the dataset have been released through the years, and the specific version is indicated in Table 6 by specifying the number of input image groups.

The number of image groups is also one of the discriminating elements reported in Table 7 along with other cardinality-related information. The stimuli are described in terms of data and content type. For most reported datasets, the resolution is extremely heterogeneous across images, and it is therefore reported as a minimum-maximum side pair. The "same subject" column indicates whether each image group depicts exactly the same instance of the subject from different points of view, or multiple instances of the same category. All the reported co-saliency datasets provide a binary salient object region annotation, i.e., none have been collected with the aid of eye tracking devices for scanpath acquisition, relying instead on manual annotation of the contours of salient objects.

RGBD Coseg183 [30] is a dataset developed for those co-saliency estimation methods that exploit the depth information associated with the input RGB image. It is partially composed of images from the RGBD Scenes Dataset [99], which were acquired using a prototype PrimeSense RGB-D camera and a firewire camera from Point Grey Research.

RGBD Cosal150 [96] is a selection of images and depth maps originally coming from the RGBD NJU-1985 dataset [100] (Nanjing University), which are augmented with co-saliency pixel-level annotations. The depth information in the original dataset comes from mixed sources: either from the Kinect device used for acquisition, or inferred through an optical-flow-based method [101]. This dataset has been presented in the previously discussed method by Cong et al. (2018).

iCoseg [98] was collected using the "Group" functionality in the Flickr photography platform, in order to collect groups of images belonging to the same category (and sometimes, the same photographer), which includes various wild animals, popular landmarks, and sports teams. The authors also made available for public download the developed interface that was used to interactively annotate the dataset.

MSRC [86] (from Microsoft Research) is the oldest dataset commonly used for training and evaluation of co-saliency algorithms, although originally collected for applications related to image classification. Multiple versions of the dataset exist, with the number of image groups ranging from 7 to 23. Table 7 reports information regarding the 14-groups version of the dataset.



**Table 7.** Selected datasets for co-saliency estimation with corresponding characteristics (OR = Object Regions).


Authors of the Cosal2015 [87] dataset gathered images in challenging scenarios from the YouTube video set [115] and the ILSVRC2014 detection set [84] (ImageNet Large Scale Visual Recognition Competition), observing that images belonging to the same group often involve similar backgrounds, leading to potentially wrong co-saliency estimations. The dataset has been annotated by 20 different users, whereas most of the other reported datasets involve one human annotation per image.

Coseg-Rep [88] is a dataset for co-segmentation and co-sketch, the objective being to automatically infer a common pattern from instances of the same subject. It is composed of 22 categories of different flowers and animals, plus a special "repetitive" category, which contains images with repeating patterns aimed at inter-image co-segmentation and co-saliency.

Internet Images [102], also known as Internet Datasets, is composed of only three image groups (car, horse, and airplane), characterized however by high cardinality inside each group. It presents a total of 15,270 images, out of which 2746 are provided with a segmentation ground truth that was acquired using both the LabelMe annotation toolbox [116] and Amazon Mechanical Turk.

The Safari dataset [104] is a video-based collection of annotated sequences for object co-segmentation, partially built upon the existing MOViCS dataset [117] (Multi-Object Video Co-Segmentation). It is composed of nine videos of five animal classes: for each class, there is one video sequence containing only that specific class, plus one or more videos of the class in conjunction with other classes.

Vicosegment [37] is another, more recent, video dataset for co-segmentation and co-saliency. It is composed of 10 category groups containing similar foreground objects, and a total of 38 videos with cardinality ranging between 18 frames and 40 frames. This dataset was presented in conjunction with the already presented method by Wang et al. based on inter-video foreground correspondence and intra-video saliency stimuli.

The Image-Pair [103] dataset contains groups of only two images, depicting (at least) one common object on two different background scenes. It is composed of image pairs collected from the dataset from Hochbaum et al. [118], the Caltech-256 Object Categories database [119], and the PASCAL Visual Object Challenge dataset [120].

#### *5.3. Evaluation of Co-Saliency*

Although not specifically designed for co-saliency estimation with image groups, the Average Precision score AP is often applied for evaluation in this specific domain ([67–69,89,94,95]). It is proportional to the area under the Precision/Recall curve, generated as defined in Section 3.3.

Other measures commonly used for co-saliency evaluation are the *F<sup>β</sup>* ([37,67–70,89,94–96]) and the area under the ROC curve AUC ([67–70]).

The dataset/method matrix for co-saliency estimation presented in Table 6 suggests using either iCoseg [98] or MSRC [86] as a comparison benchmark. We decided to focus on iCoseg, due to the extreme variability of MSRC versions adopted by different methods. Results are reported in Table 8: the overall best performance is reached by Zheng et al. [95], followed by Zhang et al. [67] and Hsu et al. [94] for *F<sup>β</sup>* and Average Precision (AP).



\* Values inferred from graphs in the corresponding publication.

All these solutions are VGG-16-based neural networks, adopting a late-fusion approach. This common pattern can be justified as semantic interpretation is particularly relevant in a domain that requires finding common elements across different images. At the same time, the recent inter-saliency/intra-saliency paradigm, although promising in the context of the corresponding publications, is possibly not yet mature enough. In this specific evaluation setup, in fact, the work by Yao et al. [89] presents the lowest performance. It should be noted, however, that the corresponding solution performs the selection of image groups in a completely unsupervised manner, while all other methods rely on existing annotated clusters.

#### **6. Video Saliency**

Saliency estimation in video sequences presents a specific set of advantages as well as original challenges. In the same spirit as co-saliency, the availability of multiple images (i.e., frames) imposes useful constraints on the ill-posed problem of saliency estimation. Unlike co-saliency datasets, video saliency ones are sometimes collected with the use of an eye tracking device, instead of manually segmenting the elements of interest in each frame. One effect of this approach is the high variability in the ground truth saliency maps across different frames: Li et al. [43] and Fan et al. [49] recently proposed to explicitly consider the phenomenon of saliency shift, where the viewer's attention can briefly change due to distracting elements, or even transfer indefinitely to a whole different salient object. Furthermore, as noted by Ullah et al. [121], saliency estimation in videos can prove to be particularly difficult when the salient object is in motion, it is small, it changes shape, and it is embedded in a context where the whole camera is moving.

#### *6.1. Methods for Video Saliency*

Table 9 enumerates recent solutions for saliency estimation in video sequences, along with additional pieces of information. Particular attention should be paid in differentiating methods that target salient object region annotations, and those who target fixations-related saliency maps [65,66]. Specifically for the former category, some of the described solutions are tested against datasets that were originally annotated for video segmentation [122–124], and in some cases the method itself is described as addressing "saliency-based video segmentation" [121,125,126], showing once again the correlation between such tasks.



An inherent characteristic of video-based processing is the temporal window, i.e., the number of frames that are jointly analyzed in order to exploit the time dimension. Methods indicated with ∞ are not constrained with an explicit limit in the temporal window, although the influence of other frames to the current one typically decreases with their distance. Other criteria useful in discriminating among different methods include the type of representation involved (such as optical flow), and the type of model involved. In this case, for deep learning methods, the backbone CNN is also reported.

In computer vision, optical flow can be defined as a displacement vector field that describes, for each pixel in each frame, the direction and intensity movement from the previous frame (or frames). Solutions for video saliency estimation based on optical flow [18,21,66,121,125,127,128,130] demonstrate that explicitly and compactly representing the time-wise variations provide a valuable piece of information for accurate detection of salient objects in video sequences. Cong et al. (2019-III) [129] designed a single-frame saliency model based on sparsity-based reconstruction, and an inter-frame saliency map based on progressive sparsity-based propagation. The two maps are then incorporated in a global consistency energy formulation to achieve spatio-temporal smoothness. Hu et al. [125] framed the problem at hand as an "unsupervised video segmentation" task. They exploited edge-aware features and the optical flow representation to develop a novel diffusion technique based on a neighborhood graph. With this approach, they were able to eventually produce a generic object segmentation based on the propagation of estimated saliency information. Zhou et al. [130] developed a three-step framework. A set of localized estimation models, generated through a random forest regressor, can be first used to create a temporary saliency map. This is then improved through a spatio-temporal refinement step, based on appearance and motion information. The resulting map is finally used to provide saliency cues for the following frame estimation. Ullah et al. [121] presented an approach for so-called "unconstrained video segmentation". They first generate an initial set of saliency regions through a novel saliency measure. They then compute a homography over optical flow information to retrieve motion cues that are robust to background motion. The two pieces of information can be eventually combined, expanded and refined. Wang et al. [126] developed a super-pixel-based technique that initially produces a prior map for pixel-wise labeling, exploiting a geodesic distance. They then formulated the task as an energy minimization problem, operating on foreground-background models and dynamic location models as unary terms, as well as label smoothness potentials as pairwise terms. Chen et al. [131] designed a method for video saliency detection based on spatio-temporal fusion and low-rank coherency guided diffusion. They first segment the input video into batches, where motion clues are internally diffused. Interbatch saliency priors are then taken into account for a low-level saliency fusion. These clues are eventually used to guide a saliency diffusion step.

Similarly to what has been observed with co-saliency and saliency in omnidirectional images, recent methods in the domain of video saliency are also equally spread among hand-crafted solutions [121,125,126,130,131], and those based on a deep-learning approach [49,65,66,127,128]. Belonging to the latter category, Fan et al. [49] collected and annotated the DAVSOD dataset (Densely Annotated Video Salient Object Detection), and proposed a neural-network-based approach to video saliency detection that explicitly addresses the problem of "saliency-shift" (the phenomenon where human attention switches from one element to another during the stimulus). Their solution is based on convolutional LSTM (Long-short term memory) modules. Li et al. [127] designed a multi-task neural network for salient object detection in video sequences. The first task, accomplished by the first sub-network, consists of still-image saliency estimation. The second task aims at motion saliency detection based on optical flow images. The two sub-networks were trained end-to-end with the integration of specifically-designed motion-guided attention modules. Yan et al. [128] proposed a solution for video saliency estimation that does not rely on densely-annotated video sequences. They first developed a technique to generate pixel-level pseudo- ground truths from sparsely annotated video frames, based on a neural network operating on optical flow images. They then trained a neural model composed of a spatial refinement network and a spatio-temporal module on their artificially-augmented training data.

As mentioned, some solutions target a different representation of video saliency information, namely fixation-related saliency maps. Gorji et al. [66] focused on the concept of attentional push: a family of saliency cues that include following the gaze of depicted subjects, accounting for the salient element leaving the scene, and for abrupt scene changes in general. They exploited these concepts to augment a static saliency estimation with the objective of minimizing the relative entropy between estimated and expected fixation patterns. Min et al. [65] presented TASED-Net: a Temporally-Aggregating Spatial Encoder-Decoder neural architecture based on the S3D [132] model (and, consequently, on the Inception model [133]), that produces an estimation of saliency for a single frame based on a finite number of previous frames. In order to produce a continuous saliency estimation, the developed network can be applied in a temporal-sliding-window fashion over the whole input sequence.

#### *6.2. Datasets for Video Saliency*

Table 10 illustrates the datasets that were involved in the experiments of each analyzed method for video saliency estimation, with the objective of highlighting the relevant benchmarks for recent developments in this field. We separate the datasets related to methods that target different types of ground truth data, highlighting how UCFSports [134] is in fact used by solutions belonging to both worlds. Regarding methods aimed at salient object regions, it can be observed that the most frequently-adopted datasets are FBMS [122] (Freiburg–Berkeley Motion Segmentation) and SegTrackV2 [123]. Despite not being very recent (both were released in the year 2013), they are described in-depth in the following, due to their high relevance. Conversely, datasets that are particularly old, and which have been tested against only by one or a few methods, are no further analyzed.

Table 11 therefore presents detailed information for the selected datasets, reporting information on both the stimuli and the user responses. As indicated, some saliency datasets that are specific for the video domain are exclusively annotated with salient object regions [43,122–124]. Others are collected with an eye tracking device, thus producing saliency maps based on user fixations [134,135]. Finally, the very recent DAVSOD [49] provides both types of annotation, thus highlighting the existing relationship between these different representations.

DAVSOD [49] (Densely Annotated Video Salient Object Detection) is built upon the stimuli from the DHF1K [135] (Dynamic Human Fixation 1000) eye tracking dataset, manually trimmed into short video clips. The scenes are enriched with additional annotations, which include: timestamp of the shift in visual attention, category labeling into 7 classes and 70 sub-classes, and conversion of the fixation records into hand-drawn object segmentation masks, performed per-frame by multiple annotators.

FBMS [122] (Freiburg–Berkeley Motion Segmentation) is a dataset composed from existing sources (Brox et al. [136] and the Hopkins 155 [137]) as well as new sequences, for a total of 59 video clips. The videos have been specifically collected aiming at high variation in image resolution and motion types, and have been manually annotated every 20th frame, thus providing a sparse ground truth.

SegTrack [138] and SegTrackV2 [123] are among the most tested-against datasets for video saliency estimation, despite being originally addressed at video segmentation. Both versions were collected with particular attention at equally representing challenging aspects, namely: color overlap between foreground and background, inter-frame motion, and changing target shape. The second version of the dataset introduces additional sequences and annotations.

VOS [43] (Video Object Segmentation) is composed of videos collected from internet sources as well as personal collections, divided into an easy and a difficult subset. One keyframe every 15 frames has been segmented at the object-level by a pool of four subjects. A different set of subjects have been asked to free-view the videos, in order to collect their eye tracking data, which are eventually used to unambiguously define and annotate the salient objects.




*Appl. Sci.* **2020**, *10*, 5143



DAVIS [124] (Densely Annotated VIdeo Segmentation) comprises high-resolution short sequences that are manually annotated for pixel-accurate segmentation. Each clip depicts up to two spatially-connected objects, aiming at constraining the problem to a controlled and limited domain. Finally, all sequences are labeled with multiple attributes covering challenging aspects such as clutter, blur, appearance change, and many others.

UCFsports [134] was built upon the pre-existing large scale video dataset of the same name by Rodriguez et al. [145] from the University of Central Florida, originally published for human action recognition. This collection is composed of high-resolution recordings from television shows, covering nine sport action classes. Nineteen human subjects were divided into three groups and tasked with different objectives, namely: action recognition, context recognition, and free-viewing. The same procedure has been applied to build the Hollywood-2 saliency dataset, on top of the existing data from the dataset by Marszalek et al. [146].

DHF1K [135] (Dynamic Human Fixation 1000) has been collected with YouTube videos retrieved through 200 key search terms, following the principles of large scale and high quality, diverse content, varied motion patterns, and various objects. Seventeen subjects were tasked with free-viewing 10 sessions of non-overlapping videos presented in random order. Furthermore, five subjects were asked to provide an additional piece of annotation regarding the number of objects in each sequence.

#### *6.3. Evaluation of Video Saliency*

The analyzed methods for video saliency estimation introduce two domain-specific evaluation measures, namely the Temporal stability (T ) [125], and the Per-frame pixel error rate (*e*) [121]. Both are based on a salient object region ground truth. Temporal stability T is computed as the distance between the descriptors of the segmentation boundaries between two successive frames, in terms of shape context descriptors [155]. Per-frame pixel error rate *e* is computed as:

$$\varepsilon = \frac{\text{XOR}(th(P), G)}{N} \tag{16}$$

where *th*(*P*) is a binary (thresholded) version of the predicted saliency map, *G* is the reference ground truth, and *N* is the total number of frames in the input sequence.

Other general-purpose measures often used to evaluate saliency estimation in the video domain include *F<sup>β</sup>* ([49,126–128,130,131]), the Precision/Recall curve ([126–128,130,131]), and MAE ([49,126,127,130]).

The landscape defined by the dataset/method matrix for video saliency estimation in Table 10 is particularly scattered. We report in Table 12 quantitative results for the frequently adopted SegTrack v2 dataset, and for the DAVIS dataset. These two datasets are comparable in terms of video length and type of annotations, with DAVIS being composed of about three times as many sequences, at a higher resolution.

**Table 12.** Quantitative comparison of selected methods for video saliency estimation on the SegTrack v2 and DAVIS datasets. Best results are highlighted in boldface.


\* Values inferred from graphs in the corresponding publication.

We did not include an analysis of FBMS due to the wider variability of versions (subsets of video sequences) used by different methods. Drawing any conclusions in the reported scenario is particularly challenging: on the SegTrack v2 dataset, the hand-crafted method by Wang et al. [126] appears to be the most well-balanced solution according to *F<sup>β</sup>* and MAE, while on DAVIS the best results are obtained by Li et al. [127], which is a deep-learning model. At the same time, the best-*F<sup>β</sup>* method on SegTrack, developed by Zhou et al. [130], reports worse performance on other metrics and datasets. Fan et al. [49], which is based on the recently-introduced concept of saliency shift, reaches the best result in terms of MAE, at the cost of penalizing *Fβ*-based evaluation. It is therefore ultimately not clear whether one type of solution should be preferred against another, for saliency estimation in video sequences.

#### **7. Conclusions**

We presented a survey on visual saliency estimation, by focusing on recent developments in domains that are not restricted to the traditional single-image input. Adequately modeling the process of visual saliency has been shown to be particularly useful and/or effective in specific cases, such as omnidirectional images employed in virtual reality scenarios, image groups depicting the same subject for co-saliency estimation, and finally video sequences for video saliency estimation.

Omnidirectional images, in particular, are the most recently-introduced domain for saliency. Many different methods in the analyzed literature approached the problem by developing novel representations of the input data, in a form that does not introduce, or that prevents, image distortions which might negatively impact the saliency estimation process. An evaluation of methods that are directly comparable showed that hand-crafted solutions present excellent results in this particular domain. Co-saliency estimation exploits the concept of image groups to partially constrain the ambiguous concept of visual saliency. Recent methods in this domain are focusing on the independent estimation of intra-image saliency (the traditional concept of image saliency) and inter-image saliency (finding common elements among images in the same group), and their subsequent combination. Direct comparison showed the apparent superiority of deep learning solutions for this specific domain. Video sequences offer yet another example of leveraging multiple pieces of input data to facilitate the saliency estimation process. The nature of ground truth data is inherently different from that of the traditional domain, as the viewer's attention can move to different elements in the short or long term. This phenomenon is called "saliency shift", and it has been explicitly addressed by recent methods in the field.

The ground truth information for visual saliency can be collected in different forms and levels of abstraction: scanpaths (directly related to eye-gaze trajectories), continuous saliency maps, and binary salient object regions. The datasets involved in recent methods for saliency estimation have been described, among other criteria, in terms of their ground truth nature. Datasets composed of omnidirectional images are provided with either scanpaths or saliency maps, i.e., no binary segmentation masks are provided. Conversely, all analyzed datasets for co-saliency are manually annotated in terms of binary salient objects, without the use of eye tracking devices. Finally, the domain of video saliency offers a heterogeneous scenario, with many datasets offering ground truth data at all levels of abstraction.

As a general observation that covers all analyzed domains, it is worth noting that a well-balanced distribution persists, between traditional hand-crafted algorithms and deep learning methods, among recent solutions for the problem of visual saliency estimation.

In conclusion, this work complements existing state of the art analyses that mainly focuses on regular images. We integrated such studies with a review on saliency estimation for omnidirectional images, image groups, and video sequences. A natural extension of this work is to develop a thorough analysis of emerging topics such as light field saliency and hyper-spectral saliency, as well as widely-explored domains such as depth-assisted visual saliency estimation.

**Funding:** The research leading to these results has received funding from TEINVEIN: TEcnologie INnovative per i VEicoli Intelligenti, CUP (Codice Unico Progetto - Unique Project Code): E96D17000110009 - Call "Accordi per la Ricerca e l'Innovazione", cofunded by POR FESR 2014-2020 (Programma Operativo Regionale, Fondo Europeo di Sviluppo Regionale—Regional Operational Programme, European Regional Development Fund).

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Applied Sciences* Editorial Office E-mail: applsci@mdpi.com www.mdpi.com/journal/applsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18