**2. Proposed Method**

Each digital image from the dataset is converted into an RGB color space matrix representation. The segmentation process can be summarized in three different phases: dimensionality reduction by clustering approach, grouping as graph partitioning and a final ROI extraction. The introduction of a preliminary clustering step determines a speed up in N-Cuts performance arising from the theoretical proofs by the N-Cuts original paper regarding computational complexity in terms of both space and time. The algorithm constructing a region adjacency graph (RAG) does not consider each pixel from the original resolution anymore, but groups of them preserving spatial and color differences amongs<sup>t</sup> them. Finally, we aim at grasping a non-linear relation between brightness intensities from the red and green channels, based on previous assumptions of reflectance rate by a spectrophotometry point of view.

### *2.1. K-Means Dimensionality Reduction*

The objective of a clustering task is grouping data instances into subsets maximizing a similarity measure, while different instances should belong to different groups [39–41]. We applied the principles from k-means clustering to image segmentation tasks. The main goal in this phase is to produce a feature space similarly to Voronoi diagrams for planes, reducing the complexity of the graph representing the original image. Each pixel from now on will be referred to as a vector in a five-dimensional space: x and y coordinates from the matrix; R, G and B channel intensities from color representation.

$$f(x,y) = \overrightarrow{p'} = \alpha\_x \overrightarrow{p\_x} + \alpha\_y \overrightarrow{p\_y} + \alpha\_r \overrightarrow{p\_r} + \alpha\_\xi \overrightarrow{p\_\xi} + \alpha\_b \overrightarrow{p\_b} \tag{2}$$

This approach allows us to iteratively minimize the sum of distances from each pixel to its cluster centroid. We briefly summarize the steps of the algorithm as follows:


$$d(\overrightarrow{\boldsymbol{u}}^{\flat}, \overrightarrow{\boldsymbol{v}}^{\flat}) = \|\overrightarrow{\boldsymbol{u}}^{\flat} - \overrightarrow{\boldsymbol{v}}^{\flat}\| = \sqrt{\left(\boldsymbol{u}\_{\mathcal{X}} - \boldsymbol{v}\_{\mathcal{X}}\right)^{2} + \left(\boldsymbol{u}\_{\mathcal{Y}} - \boldsymbol{v}\_{\mathcal{Y}}\right)^{2} + \left(\boldsymbol{u}\_{\mathcal{I}} - \boldsymbol{v}\_{\mathcal{I}}\right)^{2} + \left(\boldsymbol{u}\_{\mathcal{S}} - \boldsymbol{v}\_{\mathcal{S}}\right)^{2} + \left(\boldsymbol{u}\_{\mathcal{b}} - \boldsymbol{v}\_{\mathcal{b}}\right)^{2}} \tag{3}$$


$$c\_k = \frac{1}{n} \sum\_{i=1}^n \overline{p\_{ki}} \tag{4}$$

This approach included in the broader field of unsupervised learning approaches, consists of initial batch updates, in which at each step we reassign points to their nearest cluster centroid, followed by cluster centroid recalculations. In online updates, the points are reassigned only if reducing the sum of intra-cluster distances. Those updates already converge towards a local minimum in short order.

In Figure 3, the original image is processed with a three-dimensional (R, G and B) space and in the last picture with a five-dimensional model including both color and spatial features. In the latter, there is not an increase in computational complexity since the only calculation affected is the distance function. However, in each digital image analyzed, the intra-cluster variance is minimized efficiently with properly outlined boundaries in between each group of pixels. The classified instances closer to mucocutaneous junction are noisy in the first approach, while on the second one each semantic class (iris, pupil, sclera, eyelid, and conjunctiva) appears as a compact union of clusters.

**Figure 3.** (**a**) Original digital image acquired; (**b**) k-means clustering procedure using only three dimensional (R,G and B) channels from color space; (**c**) proposed k-means procedure with a model in five dimensions retaining both spatial and color properties.

### *2.2. Normalized Cuts Segmentation*

K-means as a clustering algorithm is a valuable approach for exploiting local impressions of a scene, but it lacks in providing a global or hierarchical perspective. For this reason, we take advantage of a grouping algorithm treating the segmentation task as a graph partitioning problem, such as NCuts. It has a better ability to generalize when applied to different scenarios. Conventionally, the normalized cut is an unbiased measure of dissimilarity between graph subgroups [42]. We have converted the set of superpixels from a five-dimensional feature space in a weighted undirected graph *G* = (*<sup>V</sup>*, *<sup>E</sup>*). Each point is included in the set of nodes having one edge for each pair of vertices.

The region adjacency graph is constructed based on precomputed areas from the k-means segmentation algorithm. Each connection amongs<sup>t</sup> them is depicted in Figure 4b and representable in a weight matrix W. The edge weight *wij* from node *i* to node *j* is defined as in the standard approach of normalized cuts as a product of a feature similarity and a spatial term. *X*(*i*) is the coordinate vector of the centroid pixel and *F*(*i*) is a feature vector based on averaged R, G and B intensities of each pixel in the area. The value *r* acts as a proximity threshold based on the Euclidean distances amongs<sup>t</sup> precomputed centroids. In our specific application we have tried different configurations ranging from 3 to 100, regulating the sparsity of the weight matrix but not impacting the segmentation outcome. Weights and features are described by the following equations:

$$w\_{i,j} = e^{-\frac{\|F(i) - F(j)\|\_2^2}{\sigma\_I}} \* \begin{cases} e^{-\frac{\|X(i) - X(j)\|\_2^2}{\sigma\_X}}, & \text{if } \|X(i) - X(j)\|\_2 < r\\ 0, & \text{otherwise} \end{cases} \tag{5}$$

$$F(i) = \begin{bmatrix} \frac{1}{n} \sum\_{j=1}^{n} p\_{jr} & \frac{1}{n} \sum\_{j=1} n p\_{j\xi} & \frac{1}{n} \sum\_{j=1} n p\_{jb} \end{bmatrix} \tag{6}$$

The algorithm is capable of extracting significant components from each sample from the dataset, avoiding intra-cluster variations.

**Figure 4.** (**a**) Acquired sample; (**b**) region adjacency graph (RAG) displaying a measure of similarity between each region. The center of each node is considered a vertex. For each connection between two regions, there is an associated colored line according to the measure of similarity.

In Figure 5, we added a visual semantic description of the resulting cuts. With this phase, we raise the level of abstraction of the segmentation, starting from the clusters of Figure 3; we end up with features closer to an anatomical perspective. The small gap in colors between the conjunctival area and mucocutaneous junction is perfectly delineated in each sample from the dataset, paving the way for a machine-learning-based anemia estimator.

In the proposed segmentation output from Figure 5, a recursive approach could be run to further decompose regions of interest from the conjunctival area. As an example, this could lead to a better parting of the two conjunctivae, palpebral and forniceal, so as to contribute to the open debate about the prevalence of one or the other as the best estimator of anemia [43]. In fact, the palpebral conjunctiva highlights the vascularization of the underlying area better than the forniceal and probably allows highlighting minimal variations of blood color. The assumption seems confirmed by scientific literature. However, some authors take into consideration the whole conjunctiva, including both palpebral

and forniceal, to construct and validate their models. It is still an open problem. Furthermore, in [43] the authors state that it should be interesting to establish whether the investigations carried out on a small portion of the conjunctiva can be sufficient and position independent. In fact, the sparsity and density of the blood micro-vessels can change in different parts of the eyelid. Therefore, the recursive identification of further clusters can help to answer the above questions.

**Figure 5.** Segmentation output result with semantic class description of eye anatomy.

### *2.3. Hemoglobin Heatmap Coefficients*

In medical image or radar signal processing tasks, contrast enhancement is a widely used technique in various applications, ranging from improving the quality of photographs acquired in poor conditions [44] to emphasizing regions of interest [45,46]. Histogram equalization is one of the most common approach due to its simple mechanism and effectiveness, but as a drawback, image brightness usually changes after the procedure, caused by its flattening behavior. In our study the objective is focused on approximating the spectrophotometry multi-layered reflectance model investigated in Section 1.2, grasping a mathematical description for digital images. In the literature several studies apply spectral domain scanning, resulting in a time-consuming acquisition process and expensive equipment. This approach does not fit our needs of developing a cheap, non-invasive diagnostic tool. An example of an ill-posed problem known as spectral reconstruction from an RGB scene has been conducted with deep learning techniques in [47,48]. Lastly, researches are highly promoting the validity of these approaches, but despite this, our application domain allows us to further reduce the solution required. Our method, interpreting the image as a signal, performs a pixel pointwise non-linear transformation from red and green color space values, returning a coefficient highlighting vascularized regions. In the literature, the ratio between R and G channels has often been used as a guide to spot those areas, thereby finding the highest values in forniceal and palpebral conjunctival tissues. We propose a generalized logistic function filtering technique including more flexibility than a standard sigmoid. Considering an image I as a vector in three channel functions based on grid coordinates, we obtain the following *σ* transformation:

$$I(\mathbf{x}, y) = \begin{bmatrix} r(\mathbf{x}, y) \\ g(\mathbf{x}, y) \\ b(\mathbf{x}, y) \end{bmatrix}, \qquad \qquad \sigma'(I, \mathbf{x}, y) = \frac{1}{1 + \epsilon^{-a(\frac{l\_r(\mathbf{x}, y)}{l\_\mathbf{g}(\mathbf{x}, y)} - \boldsymbol{\theta})}} \tag{7}$$

The parameter *α* determines the slope of the function, emphasizing the discrepancy in terms of ratio between color channels; *β* acts as a minimum ratio threshold for the activation of each pixel.

A comparison of the behavior of standard and generalized logistic function with parameterization *α* = 4 and *β* = 2 is depicted in Figure 6. This parameterization yielded results with a remarkable

capability of generalizing well in diagnostic imaging ranging from conjunctival tissue to endoscopic domains. Increasing values of *α* related to the steepness, tend towards the trivial case of a binarization step function losing information about the relationship underlying a variety of brightness ratios. An application of this model is illustrated in Figure 7 useful for digital images of the conjunctival region.

**Figure 6.** (**a**) Standard logistic function plot. (**b**) Generalized logistic function plot using parameters *α* = 4 and *β* = 2.

**Figure 7.** (**a**) Acquired sample. (**b**) Heatmap plot of the scoring matrix displaying the magnitudes of the coefficients computed by applying the generalized sigmoid function on the acquired sample.

The real values range from 0 to 1 according to *σ* function definition. The filtering process produces a scoring matrix assigning lower values to the background, including the sclera, pupil, iris, eyelid and white support platform from the device. Palpebral and forniceal conjunctiva are primarily perfused by both internal and external carotid arteries; this is reflected in high values from the scoring matrix ranging from 0.7 to 1, and the respective blood vessels are significantly highlighted, as shown in Figure 7b.

Since we are interested in obtaining a semantic interpretation out of the regions proposed by NCut, the matrix of coefficients acts as an effective bias for calculating the probability distribution of each class. Edge weights crossed by aggregated pixels resulting from *σ* are strengthened or decreased, resulting in a region proposal based on the magnitude of the connection. In Figure 8, we provide a subset of 10 digital images from the dataset, showing the qualitative difference between the proposed semantic segmentation (top row) and the manually segmented ground truth (second row). In Figure 9 we provide two samples of erroneous acquisitions in order to show the robustness of the proposed segmentation in unusual conditions; in fact, only images with excellent characteristics can provide useful information for the correct estimation of anemia.

**Figure 8.** The top row represents a subset of samples automatically segmented with the proposed approach. The ordered second row depicts the mapping with the manual segmentation ground truth of the conjunctival region.

**Figure 9.** Examples of two images that would normally be discarded: the first because the eyelid overlaps the edge of the white spacer and is not perfectly in focus; additionally, the second one is not in focus and the finger appears to lower the eyelid. In both cases, automatic segmentation would still provide an acceptable result.
