Next Article in Journal
Uncertainty Quantification Based on Block Masking of Test Images
Previous Article in Journal
Innovating Intrusion Detection Classification Analysis for an Imbalanced Data Sample
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FA-Seed: Flexible and Active Learning-Based Seed Selection

by
Dinh Minh Vu
1,* and
Thanh Son Nguyen
2,*
1
Faculty of Software Engineering, School of Information and Communication Technology, Hanoi University of Industry, No. 298 Cau Dien Street, Tay Tuu Ward, Hanoi 100000, Vietnam
2
Faculty of Economic Information Systems, Academy of Finance, No. 58, Le Van Hien Street, Dong Ngac Ward, Hanoi 100000, Vietnam
*
Authors to whom correspondence should be addressed.
Information 2025, 16(10), 884; https://doi.org/10.3390/info16100884
Submission received: 2 September 2025 / Revised: 28 September 2025 / Accepted: 4 October 2025 / Published: 10 October 2025

Abstract

This paper addresses the fundamental problem of seed selection in semi-supervised clustering, where the quality of initial seeds has a significant impact on clustering performance and stability. Existing methods often rely on randomly or heuristically selected seeds, which can propagate errors and increase dependence on expert labeling. To overcome these limitations, we propose FA-Seed, a flexible and adaptive model that integrates active querying with self-guided adaptation within the framework of fuzzy hyperboxes. FA-Seed partitions the data into hyperboxes, evaluates seed reliability through measures of membership and association density, and propagates labels with an emphasis on label purity. The model demonstrates strong adaptability to complex and ambiguous data distributions in which cluster boundaries are vague or overlapping. The main contributions of FA-Seed include: (1) automatic estimation and selection of candidate seeds that provide auxiliary supervision, (2) dynamic cluster expansion without retraining, (3) automatic detection and identification of structurally complex regions based on cluster characteristics, and (4) the ability to capture intrinsic cluster structures even when clusters vary in density and shape. Empirical evaluations on benchmark datasets, specifically the UCI and Computer Science collections, show that our approach consistently outperforms several state-of-the-art semi-supervised clustering methods.

1. Introduction

Semi-supervised clustering (SSC) has emerged as an important paradigm in modern machine learning, as it leverages labeled seeds to improve clustering quality [1]. SSC enables learning from both limited labeled data and abundant unlabeled data, and is particularly useful in scenarios involving noisy or complexly distributed data. Pedrycz [2] demonstrated that, for practical applications which often require multiple intermediate ways of discovering structure in a dataset, performance can be significantly improved by using prior information; even a small proportion of labeled samples can markedly enhance clustering results.
Semi-supervised fuzzy clustering (SSFC) extends fuzzy clustering by incorporating semi-supervised learning. SSFC integrates the flexibility of fuzzy logic with the guidance of partial supervision [3,4,5]. Unlike hard clustering, which assigns each data point to exactly one cluster, fuzzy clustering allows each point to belong to multiple clusters with varying degrees of membership. This feature is especially suitable for problems with unclear cluster boundaries or noisy data.
Within the SSFC framework, the selection of high-quality seed instances plays a decisive role in clustering performance [6]. Well-chosen seeds can reduce labeling costs, enhance cluster separability, and stabilize convergence, whereas poorly chosen seeds may propagate errors, degrade clustering accuracy, and increase reliance on expert intervention. Despite its importance, seed selection remains a challenging and underexplored problem, requiring methods that are both theoretically sound and practically efficient. Owing to its ability to effectively exploit a small amount of auxiliary information in the training data to guide the learning process, SSFC has attracted significant attention from the research community and has been widely applied in domains such as pattern recognition, information processing, and healthcare [5,7,8,9,10,11].
The difficulty of selecting good seeds arises from three main factors: (i) the non-trivial task of identifying informative patterns within uncertain cluster boundaries; (ii) the presence of noisy and imbalanced data, which can distort the clustering process; and (iii) the limited labeling budget, which necessitates efficient and adaptive query strategies.
Several studies have attempted to improve SSC through different strategies, including graph-based propagation, constraint expansion, and probabilistic seed selection [6,11]. While these methods have achieved certain successes, they still share common limitations: reliance on randomly chosen seeds or predefined constraints, lack of mechanisms to actively evaluate seed reliability, and insufficient robustness when dealing with noisy or overlapping clusters.
Fuzzy min–max neural networks (FMNNs) have recently garnered increased attention in semi-supervised learning research, thanks to their ability to make effective use of limited labeled data. FMNN functions by dividing the data space into HXs. The visualization in Figure 1 illustrates the distribution of 2D HXs belonging to seven clusters on the Aggregation dataset [12], as partitioned by FMNN. In this figure, each HX is defined over the selected feature dimensions of the dataset, ensuring that informative attributes are retained for conceptualizing the clustering model.
Based on FMNN, models such as GFMNN (General FMNN) [13], SCFMN (SSC in FMNN) [9], and MSCFMN (Modified SCFMN) [5] have demonstrated the extensibility of FMNN in SSC tasks. GFMNN maintains the HX expansion–contraction mechanism but assigns labels randomly to a portion of the training data; therefore, it requires a high proportion of labeled samples to achieve good performance, and the clustering results are often unstable. Based on FMNN, models such as GFMNN (General FMNN) [13], SCFMN (SSC in FMNN) [9], and MSCFMN (Modified SCFMN) [5] have demonstrated the extensibility of FMNN in semi-supervised clustering tasks. GFMNN maintains the HX expansion–contraction mechanism but assigns labels randomly to a portion of the training data; therefore, it requires a high proportion of labeled samples to achieve good performance, and the clustering results are often unstable. To address these shortcomings, SCFMN was introduced with a two-stage mechanism: it automatically identifies seed sets from HX partitions in the first stage and then repeatedly retrains the entire dataset to form HXs and assign labels. However, this approach leads to high computational costs and strong sensitivity to parameter settings. MSCFMN refines SCFMN with a four-phase framework: labeled samples are first separated to form reliable HXs, while the remaining data are processed through HX expansion and label inheritance based on centroid deviation and membership thresholds. This approach eliminates repeated retraining and provides greater stability. Nevertheless, MSCFMN still heavily depends on the quality of initial seeds and performs labeling in a passive manner, lacking any active strategy to identify the most informative samples for expert queries.
Although FMNN-based approaches have certain limitations, they remain a promising research direction with considerable potential and applicability [5]. In this study, we introduce an SSC model built upon FMNN and enhanced with an active seed selection strategy, referred to as FA-Seed. Rather than selecting seeds randomly, the model prioritizes data points that are expected to yield the most informative labels. This design aims to maximize information gain from each expert query while simultaneously minimizing the overall labeling effort.
The proposed model is applied to benchmark datasets to demonstrate its effectiveness in handling clustering problems characterized by fuzzy boundaries, class imbalance, and scarce labeled data. The main contributions of this work include:
  • Proposing an active seed selection strategy based on FMNN;
  • Evaluating seed quality under limited label availability;
  • Providing label suggestions to experts along with an optimized validation mechanism;
  • Assessing performance on benchmark datasets.
The remainder of this paper is organized as follows: Section 2 presents the background and related work. Section 3 describes the proposed model in detail. Section 4 reports the experimental results. Finally, Section 5 discusses the conclusions and future research directions.

2. Related Works

In [6], the research team developed SSGC (Semi-Supervised Graph-based Clustering), an active learning algorithm that identifies seed instances using the k-nearest neighbor and min–max algorithms. Its key idea is to construct a weighted k-NN graph based on shared neighbors and then partition it into connected components under a cut condition. Although its label propagation mechanism—combining an adaptive cut threshold ( θ ) with graph connectivity—is efficient and flexible for varying cluster shapes, seed selection in SSGC remains random, making performance sensitive to initial labels.
Beyond graph-based approaches, Xin Sun et al. [11] proposed a constraint-driven framework consisting of CS-PDS (Constraints Self-learning from Partial Discriminant Spaces) and FC2 (Finding Clusters on Constraints). CS-PDS iteratively expands initial constraints using an Expectation–Maximization cycle, while FC2 performs clustering on the enriched constraint set. This approach achieves strong performance on small-scale data but relies heavily on the representativeness of initial constraints and lacks an explicit active query mechanism.
A probabilistic alternative was introduced by Bajpai et al. [10], who employed Determinantal Point Processes (DPPs) to optimize seed diversity. Unlike SSGC, DPP selects seeds that are both representative and broadly distributed, which are then used as K-Means centroids, reducing convergence to poor local optima. However, this approach incurs a high computational cost ( O ( n 3 ) ) and struggles with noisy or overlapping clusters due to the absence of propagation or denoising mechanisms.
Another important line of research builds on FMNN, which leverages HXs to represent uncertainty and handle limited supervision. Among these approaches, MSCFMN stands out by refining HX construction and overlap handling. In MSCFMN, the number of HXs is fixed to the number of clusters; they expand under a threshold θ , with contractions applied to resolve overlaps. While such refinements improve accuracy over earlier FMNN-based models, they still suffer from rigid partitioning of the feature space, the absence of active seed selection, and high sensitivity to the threshold parameter θ .
Our proposed FA-Seed extends MSCFMN by retaining its HX-based partitioning mechanism while introducing adaptive seed selection, seed quality scoring, and few-shot supervision. These enhancements address the limitations of MSCFMN and enable FA-Seed to achieve superior clustering performance under limited supervision.

3. Proposed Method: FA-Seed

3.1. The Idea of Proposed Model

The effectiveness of SSC depends strongly on the quality of selected seeds. Poor seeds can propagate errors, while reliable seeds reduce labeling costs and improve clustering accuracy. However, selecting high-quality seeds is challenging in practice due to noisy data, overlapping clusters, and limited supervision. To address this, the proposed FA-Seed model extends the MSCFMN framework [5] by integrating HX partitioning, active querying, and few-shot extension.
Built on the MSCFMN, FA-Seed represents clusters as HXs, naturally handling uncertainty and overlapping regions. Unlike random or fixed initialization, the model adaptively scores candidate seeds using density, centrality, and uncertainty, and selectively queries the most informative samples. Verified seeds then guide controlled label propagation through the HX structure, ensuring consistency and reducing error propagation.
Unlike MSCFMN, which automatically assigns a single labeled point to each HX based on data distribution characteristics, FA-Seed incorporates an active learning mechanism to purposefully evaluate and select potential seed candidates. Rather than relying on default labeling, the model actively queries and verifies informative data points with experts. Additionally, FA-Seed integrates various criteria such as centroid deviation, data density, and uncertainty to filter and assess seed quality, thereby enhancing the model’s generalization ability and clustering accuracy under limited labeled data conditions.
Figure 2a visually illustrates the limitation of the MSCFMN model during Phase 1—the phase of defining seed selection—on the Aggregation dataset. The Aggregation dataset consists of 788 samples grouped into 7 clusters [12], as shown in Figure 2a.
In MSCFMN, the number of HXs generated during this initial phase is constrained to match the number of clusters. Specifically, only seven HXs are created, as seen in Figure 2b. However, several of these HXs (e.g., HXs 1, 3, 4, and 5) overlap with multiple clusters, reducing the reliability of label assignment for the corresponding seed points. If seed points from such ambiguous HXs are selected and used in label propagation, the model may suffer from reduced accuracy due to mislabeling. Furthermore, this restricted partitioning approach limits the model’s ability to capture more complex or uneven data distributions within individual clusters. MSCFMN is, thus, highly influenced by both the number and shapes of clusters.
In contrast to MSCFMN, FA-Seed addresses these limitations by enabling a more flexible partitioning of the input space, allowing the generation of a greater number of HXs beyond the initial number of clusters. This adaptive strategy improves coverage and granularity. As illustrated in Figure 2c,d:
  • Figure 2c: FA-Seed generates 17 HXs and selects 13 data points for expert query, accounting for 1.65% of the total samples.
  • Figure 2d: 26 HXs with 21 queries (2.66%).
  • Figure 2e: 38 HXs with 29 queries (3.68%).
  • Figure 2f: 52 HXs with 40 queries (5.08%).
Beyond simply expanding the number of HXs, FA-Seed incorporates a quality assessment mechanism to enhance label query efficiency. Specifically, only HXs with balanced and representative data distributions are selected for expert queries. HXs with skewed distributions or very few data points are excluded to avoid wasting labeling budget and to reduce the risk of incorrect label propagation.
Figure 3 illustrates the process of calculating and selecting HXs in the proposed FA-Seed model. HXs with a high level of data balance are selected (labeled numerically from 1 to 21), whereas imbalanced HXs are excluded from selection (denoted by characters a, b, c, d, and e).
Thanks to these two improvements—adaptive HX generation and pre-query quality evaluation—FA-Seed enables more effective learning in SSC tasks involving unlabeled and non-uniform data. It also enhances clustering accuracy while reducing the model’s dependence on prior assumptions about the number or shape of clusters.

3.2. Setup and Definitions

3.2.1. Fuzzy Hyperbox and Membership

Assume we have M data instances X = { x i } i = 1 M R n , where each x i R n , which can be mapped into G ground truth clusters. Let the index set { 1 , , M } be partitioned into a labeled subset L and an unlabeled subset U , with L U = { 1 , , M } and L U = , where the split is induced by partitioning the input space into HXs.
Let B = { B j } j = 1 K denote the current set of HXs (so K = | B | ). Denote the labeled and unlabeled subsets by B lab and B unl , respectively, with
B = B lab B unl B lab B unl = B , B lab B unl = .
A fuzzy HX B j = ( V j , W j ) is as defined in (2).
B j = ( V j , W j ) , V j , W j R n , V j W j
where V j and W j are the minimum and maximum vertices of the HX in the input space.
Each HX B j is associated with a class label c j { 1 , , C } , which is assigned during the learning process. For i L , labels y i { 1 , , C } are known; for i U , predicted labels are denoted by y ^ i .
Let x be a candidate sample. Its membership to B j is μ j ( x , B j ) [ 0 , 1 ] , as defined in Equation (3). Given the current set of HXs B = { B j } j = 1 K , the HX with the largest membership value for x is determined by Equation (5).
μ j ( x , B j ) = 1 n r = 1 n 1 f ( x r w j r , γ ) f ( v j r x r , γ )
where the function f ( . , . ) is a two-parameter function used to modulate the attenuation of the membership value μ j , based on the correlation between the data sample x and the boundary of the HX B j . The sensitivity parameter γ directly affects the shape of this attenuation: a larger γ leads to faster degradation of the membership value, thereby making the model more responsive to boundary violations. Its mathematical formulation is given in Equation (4):
f ( x , y ) = 0 , if x × y < 0 x × y , if 0 x × y 1 1 , otherwise
j = arg max j { 1 , , K } μ j ( x , B j ) .

3.2.2. Seed Set Based on Centers and Purity

Let T j be the core set used to compute the data center d j as in (6) (the τ -core T j = { i : μ j ( x i , B j ) τ } ). The geometric center g j is defined in (7), and the center deviation e j is defined in (8). An HX is regarded as pure if e j satisfies the constraint in (9).
d j = 1 | T j | i T j x i .
g j = 1 2 ( V j + W j ) .
e j = g j d j 2 .
e j δ , δ > 0 is user-defined .
Consequently, J Q is the index set of pure fuzzy HXs, given by (10); the candidate query set Q is the collection of their data centers, as in (11); and S denotes the set of fully contained samples, defined in (12) and generated from Q.
J Q = { j B j B , e j δ }
Q = { d j | j J Q } .
S = j J Q x i X | μ j ( x i , B j ) τ , δ > 0 , τ [ 0 , 1 ] .

3.2.3. Adaptive Partition and Assignment with Fuzzy HXs

Given the current HXs B = { B j } j = 1 K and a candidate sample x, first select the HX with the largest membership (as defined in Equation (5)). Update the space of HXs in the set B according to the following:
  • Containment rule: if j such that μ j ( x , B j ) = 1 , assign x to that B j (uniqueness follows from Ω ( B a , B b ) 0 ).
  • Expansion: form the tentative expansion of B j , as defined in Equation (13). Accept it and set B j B j if all of the following hold:
    (1)
    Size constraint θ ( x , B j ) θ max (as defined in Equation (14));
    (2)
    After any necessary contraction, the overlap constraint Ω ( B i , B j ) satisfies the constraint in Equation (16).
    If accepted, assign x to the updated B j .
  • Create a new HX: if the expansion is not accepted, initialize a new HX around x, as in Equation (17).
B j = min ( V j , x ) , max ( W j , x ) .
θ ( x , B j ) θ max , θ max is user-defined .
where θ ( x , B j ) denotes the post-expansion size of B j (e.g., as in (15)):
θ ( x , B j ) = 1 n r = 1 n | max ( x r , W j r ) | | min ( x r , V j r ) | 2 .
Ω ( B i , B j ) 0 , i j ,
where Ω ( · , · ) is the overlap measure.
B new = ( x , x ) .

3.2.4. Fuzzy HX-Level Labeling by Inheritance

With a decreasing schedule { β t } t = 0 T ( β 0 > > β T β min ), an unlabeled HX B j inherits the label of the nearest labeled HX if both gates hold: (i) the center alignment gate e j δ (with e j defined in Equation (8)) and (ii) the membership gate
μ j ( d j , B j ) β t , β t [ 0 , 1 ] .
where d j is the data center defined in Equation (6) and j is chosen by Equation (19); labels are then updated according to Equation (20).
j = arg max : B B lab μ ( d j , B ) .
When both gates are satisfied, update the label and the sets:
e j δ μ j ( d j , B j ) β t label ( B j ) label ( B j ) , B lab B lab { B j } , B unl B unl { B j } .
Iterate (20) for t = 0 , , T . Finally, submit the remaining unlabeled (outlier/atypical) HXs for expert labeling:
Q = B unl .
Edge cases: If B lab = , skip or defer inheritance or query an expert. If no HX exists yet ( K = 0 ), initialize B 1 , defined as in Equation (17). If the current sample x has a ground truth label, set c 1 = y i ; otherwise, leave it pending.

3.2.5. Training Procedure

Stage 1 (One Shot): query all seeds Q ( 1 ) = S (with S defined in Equation (11)) to obtain ( B lab , 0 , B unl , 0 ) .
Stage 2 (Progressive): with a decreasing schedule { β t } , for each u B unl , t , let j ( u ) be selected according to Equation (19).
If the alignment gate e u δ (9) and the membership gate μ j ( u ) ( d u , B j ( u ) ) β t (18) both hold, then move u to B lab , t + 1 ; otherwise, keep it in B unl , t + 1 . Finally, query the remainder
Q ( 2 ) = B unl , T + 1 .

3.3. Model Architecture and Learning Process

3.3.1. Architecture of FA-Seed Model

The FA-Seed model combines the architecture of the FMNN [14,15] with a semi-supervised clustering strategy, organized into two main phases for clustering on the dataset X. In the first phase, FA-Seed identifies seeds (labeled samples) for a subset of the data, denoted as S, while the remaining unlabeled samples form the subset U ( X = S U ). In the second phase, FA-Seed propagates labels from S to U by constructing and adapting HXs, thereby forming clusters for the entire dataset X.
As illustrated in Figure 4, the architecture of FA-Seed consists of two main components:
  • Seed scoring module: definition of seeds based on HX partitions, which evaluates candidate seeds using criteria such as data density, centroid deviation, and uncertainty, and assigns labels to reliable seeds within their respective HXs.
  • Controlled label propagation and extended evaluation: This propagates labels from the verified seed set to the unlabeled data, ensuring consistency and accuracy across clusters. This component leverages the HX structure to minimize error propagation and actively queries uncertain samples that cannot be confidently assigned to a cluster.
Figure 4. The architecture of the proposed FA-Seed model with an integrated label evaluation component.
Figure 4. The architecture of the proposed FA-Seed model with an integrated label evaluation component.
Information 16 00884 g004

3.3.2. Neural Network Architecture of FA-Seed

The FA-Seed model is structured as a four-layer neural network, as illustrated in Figure 5. We consider four layers F A F B F H F C , each playing a distinct role in the semi-supervised clustering process:
  • Input layer F A : This layer contains n nodes, each representing an input attribute. A sample is x i = ( x i 1 , x i 2 , , x i n ) and is mapped into HXs in F B using the min–max boundary vectors v and w.
  • HX scoring layer F B : This layer includes K nodes (HXs) produced at Stage 1, each corresponding to an HX B j = ( v j , w j ) B . These HXs are constructed incrementally and capture the local data structure in the input space. The activation is the membership
    b j ( x i ) = μ j ( x i , B j ) [ 0 , 1 ] .
  • Evaluation/propagation layer F H : HX nodes are partitioned into labeled and unlabeled sets: B lab (size K lab ) and B unl (size K unl ). This layer applies the gates and the inheritance rule to move HXs from B unl to B lab (at termination, B unl = so K unl = 0 and typically K lab > K ).
  • Output layer F C : This layer contains G nodes, each representing one output cluster. The binary relationship between HX L j F H and class c g F C is represented by the matrix U = { u j g } :
    u j g = 1 , if l j c g , 0 , otherwise .
    The membership score of a sample to class c g is computed as:
    c g = max 1 j q l j · u j g ,
    where l j is the activation of HX j. The set of HXs defining class g is given by:
    C g = j G L j ,
    with G denoting the index set of HXs assigned to class g.

3.3.3. Learning Algorithm

The first stage of the learning algorithm defines seeds based on the partition of the data space into fuzzy HXs. Candidate seeds are evaluated using data density, centroid deviation, and uncertainty (Equations (6)–(9)), and reliable seeds are then assigned labels within their respective HXs (Equation (12)). The learning algorithm for this phase is presented in Algorithm 1.
Algorithm 1: Seed Set Construction via Hyperbox Filtering.
Input: Dataset X ( M = | X | ); θ max ; deviation threshold δ
Output: Updated data set X, B
  • Step 1. Hyperbox generation
       Perform one pass over X to construct fuzzy HX set B ( B = { B 1 , , B j } , ( j K , K = | B | )) using min–max boundaries with θ m a x .
  • Step 2. Hyperbox filtering (Identify data instances to query)
       For each HX B j B , denote by x i X the set of samples assigned to B j , and by V j , W j R d its min/max vertices.
       Compute the data center defined by Equation (6) and the geometric center of the HX B j defined by Equation (7).
      Keep only those B j whose centroid is sufficiently close to its geometric center, and remove B j if condition (8) is violated: B : = B { B j } .
  • Step 3. Seed set construction, partition and reordering
       Build the seed set S from samples fully contained in at least one surviving HX by by Equation (12).
       Build a permutation π of { 1 , , m } that places S first and D after (X is the set of unlabeled samples, i.e., points that do not fully belong to any HX), and form the reordered dataset X = ( X π ( 1 ) , , X π ( m ) ) .
return B (the surviving HXs), X (the reordered dataset)
Computational Complexity of FA-Seed—Algorithm 1:
T 1 = O ( n M K ) .
The second stage of the learning algorithm describes the propagation of labels from the verified seed set to the unlabeled data (Equations (18)–(20)), ensuring consistency and accuracy across clusters. By leveraging the HX structure, this stage minimizes error propagation and actively queries uncertain samples that cannot be confidently assigned to a cluster. The learning algorithm for this stage is presented in Algorithm 2.
μ j ( x , L j ) = max { μ j ( x , L j ) : j = 1 , , k } ,
where θ ( x , L j ) denotes the expansion measure when adding x to L j .
max j { 1 , , | L | } μ ( H new , L j ) β ,
U l label = L j label , j = arg max j { 1 , , | L | } μ ( g l , L j ) ,
L : = L { U j } U : = U { U j }
Algorithm 2: Guided Hyperbox Expansion and Label Propagation.
Input: Reordered dataset X; S; β ; δ ; θ max .
Output: L.
Information 16 00884 i001
Computational Complexity of FA-Seed—Algorithm 2:
T 2 = O ( n M K ) .
Computational complexity of the FA-Seed algorithm:
T = T 1 + T 2 = O ( n M K ) .

4. Experiments

4.1. Experimental Objectives

The goal of this section is to evaluate the effectiveness of the proposed FA-Seed model in semi-supervised clustering tasks. Specifically, we aim to investigate: (i) the ability of FA-Seed to improve clustering accuracy under limited labeled data, (ii) its efficiency in seed selection compared to baseline methods, and (iii) its applicability to benchmark datasets [12,16].
To evaluate the performance of FA-Seed, experiments were conducted to compare it against several state-of-the-art semi-supervised clustering baselines, including SSDBSCAN, SSK-Means, K-Means, SSGC [6], MSCFMN [5], and FC2 [11].

4.2. Experimental Setup

4.2.1. Datasets

We conduct experiments on benchmark datasets to evaluate the general performance and applicability of the proposed model. The benchmark datasets are drawn from standard UCI and CS collections, including Iris, Wine, Zoo, Soybean, Ecoli, Yeast, PID, Spiral, Pathbased, Thyroid, and others. These datasets span a wide range of sample sizes, dimensions, and cluster structures, thereby providing a comprehensive basis for evaluating clustering quality under diverse conditions. The detailed characteristics are summarized in Table 1.
To ensure that all data features are referenced on a common scale, thereby guaranteeing a balanced contribution of each attribute to the clustering process and avoiding hyperbox adjustment being biased toward dimensions with larger values, all datasets are normalized to the interval [0,1] using min–max normalization. This normalization is not only a technical requirement for the FMNN-based membership function [14,15], but also directly affects the accuracy and robustness of the model. Specifically, normalization improves the precision in forming hyperbox structures and enhances robustness by reducing sensitivity to noise and scale variations among features. For datasets with missing values, the treatment follows the approach of Batista [17], in which missing entries are replaced with the mean value of the corresponding attribute, as defined in Equation (34).
A h j = 1 m i m A i j ,

4.2.2. Parameter Settings

The following settings are used unless otherwise specified: fuzziness parameter γ = 10 , initial threshold β = 0.99 , threshold decay factor ϕ = 0.9 , and deviation threshold θ δ , which was selected empirically through an iterative error-correction process. Specifically, suboptimal clustering results in early trials guided successive adjustments until stable and competitive performance was achieved across datasets.

4.2.3. Evaluation Metrics

As noted earlier, clustering performance depends on both the quality and the number of selected seeds. We therefore conduct comparative experiments on benchmark datasets, comparing against state-of-the-art semi-supervised clustering baselines and evaluating performance using Rand Index (RI), accuracy, training time, number of queries (NoQ), and number of seeds (NoS) obtained via expert querying. RI is computed by (35), and accuracy by (36).
R I ( X , Y ) = 2 ( u + v ) n ( n 1 )
A c c u r a c y = 1 n i = 1 n H ( y i = y i ¯ )
where:
  • u is the number of correct decisions when y i = y j in both clustering and ground truth.
  • v is the number of correct decisions when y i y j in both clustering and ground truth.
  • n is the total number of samples.
  • H ( y ) = 1 if the prediction is correct; otherwise H ( y ) = 0 .

4.3. Results

Before comparing FA-Seed with other semi-supervised clustering methods, we first examine its stability across repeated experiments. Figure 6 shows the mean accuracy and standard deviations over 20 runs on benchmark datasets. The results indicate that FA-Seed achieves consistently high accuracy (mostly above 0.9) with relatively small deviations, demonstrating robust and reliable performance across datasets. Larger deviations appear in complex datasets such as Thyroid and Yeast, which are inherently noisy and more challenging; nevertheless, the overall low variance confirms the stability of FA-Seed.
As analyzed, increasing the proportion of seeds often contributes to improving clustering quality. In the next experiment, we focus on evaluating the ability to obtain seeds through expert queries.
Figure 7 visually illustrates the comparison of the number of seeds obtained between the methods when using the same number of queries. The results show that, at the same NoQ level, FA-Seed consistently obtains a higher number of seeds compared to MSCFMN [5]. Notably, FA-Seed achieves a high seed ratio even when the number of queries is low. This is a significant advantage, as it greatly reduces manual labeling costs. In addition, a high seed ratio directly contributes to enhancing the model’s performance. The main reason FA-Seed achieves a larger number of seeds is its finer HX partitioning mechanism compared to MSCFMN, which allows the model to approach cluster boundaries more closely and, thus, more effectively exploit potential data points for labeling.
Figure 8 presents the comparison results of FA-Seed with SSGC, SSDBSCAN, SSK-Means, K-Means, and MSCFMN [5,6], using the Rand Index (RI) to evaluate clustering quality. The seeds are randomly selected in each run, and the process is repeated 20 times, with the average RI value recorded for comparison. From the visual results, it can be observed that FA-Seed generally performs better or at least comparable to the other methods on most datasets. For datasets with a clear cluster structure, such as Soybean, Zoo, and Iris, FA-Seed achieves near-perfect RI values and greater stability compared to other methods. This demonstrates its ability to effectively leverage seed information to assign labels accurately from the very beginning. For datasets with a large number of clusters and evenly distributed data, such as Ecoli, FA-Seed still maintains high and stable RI values when the number of seeds changes. For datasets with a strong imbalance in cluster sizes, such as Yeast (ranging from 5 to over 400 objects per cluster), or with high overlap levels, such as Thyroid, FA-Seed achieves better results thanks to its optimal seed selection mechanism from “trusted HXs” instead of random selection, thereby reducing the negative impact of small or overlapping clusters.
Figure 9 is a graphic comparing the accuracy measure between FA-Seed and MSCFMN. The results indicate that FA-Seed achieves accuracy comparable to or higher than MSCFMN on most datasets, with notable improvements on datasets with imbalanced class distributions (e.g., Thyroid, Pathbased, Zoo). This demonstrates that the high-quality seed selection mechanism in FA-Seed can enhance clustering performance even under challenging data conditions.
Figure 10 is a chart comparing the RI index between FA-Seed and FC2 [11]. The results show that FA-Seed consistently achieves higher RI values across all proportions of labeled data. Notably, the performance gap in RI between FA-Seed and FC2 is largest at lower labeling rates, confirming the advantage of the seed selection strategy in semi-supervised settings with limited labeled data.
Figure 11 presents a runtime comparison between FA-Seed and the baseline algorithms on the benchmark datasets. As shown, FA-Seed is consistently the fastest (or among the fastest). In terms of time complexity, FA-Seed has a computational cost of O ( n M K ) (Equation (33)), where M = | X | denotes the number of data samples, n is the number of attributes (data dimensions), and K = | B | is the number of HXs. In practice, K is often significantly smaller than M, allowing FA-Seed to scale efficiently even as the dataset size grows.
In contrast, SSGC—the main algorithm presented in [6]—has a complexity of O ( n 2 ) due to the need to construct a k-nearest neighbor graph and perform label propagation, which becomes costly for large datasets. SSDBSCAN achieves an average complexity of O ( n log n ) when using optimal data structures, but still reaches O ( n 2 ) in the worst case. For SSK-Means and semi-supervised K-Means, the complexity is O ( t × n × c × m ) , where t is the number of iterations and c is the number of clusters; the computational cost increases rapidly when either t or c is large.
Overall, FA-Seed demonstrates a clear advantage by partitioning the dataset into HXs and identifying “reliable HXs” to select high-quality seeds, while obtaining a larger number of seeds in a single independent stage. This approach allows users to proactively determine the number of labels based on the number of reliable HXs generated, thereby significantly reducing the labeling effort while maintaining, or even improving, clustering quality.

5. Conclusions

In this paper, we propose FA-Seed, a flexible and active learning-based seed selection model for semi-supervised clustering. By integrating adaptive hyperbox-based data partitioning with an active querying mechanism, FA-Seed effectively identifies high-quality seeds, reduces labeling costs, and improves clustering accuracy. Experimental results on standard benchmark datasets demonstrate that FA-Seed consistently outperforms baseline approaches such as SSGC, MSCFMN, and FC2 in terms of clustering quality, seed efficiency, and computational cost. These findings confirm the potential of FA-Seed for practical applications, particularly in medical diagnosis, where data distributions are complex and labeled data are scarce.
Despite its advantages, FA-Seed still has certain limitations. Its performance is sensitive to parameter settings (e.g., θ ) and remains dependent on expert feedback when clusters are highly overlapping or noisy. In the future, we plan to enhance the robustness of FA-Seed against noisy labels and uncertain expert feedback. Combining FA-Seed with deep representation learning may further improve its performance on complex data, while explainable AI techniques could enhance interpretability in sensitive domains such as healthcare. In addition, the seed selection strategy of FA-Seed is not restricted to clustering; it can be naturally extended to supervised classification learning and integrated into other network architectures such as CNNs, highlighting its universality as a general mechanism for sample selection. Finally, large-scale validation on multi-domain and multi-center datasets will be essential to confirm the practical applicability of FA-Seed in diagnostic systems and beyond.

Author Contributions

D.M.V. and T.S.N. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

Hanoi University of Industry under Grant 21-2023-RD/HD-DHCN.

Data Availability Statement

The data presented in this study are openly available in the UC Irvine Machine Learning Repository at https://archive.ics.uci.edu (accessed on 25 August 2025, reference number [16]), and in the Clustering Datasets at https://cs.joensuu.fi/sipu/datasets/ (accessed on 25 August 2025, reference number [12]).

Acknowledgments

We would like to express our sincere gratitude to Hanoi University of Industry for providing the support and favorable conditions that enabled us to complete this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CS-PDSConstraints Self-learning from Partial Discriminant Spaces
DPPDeterminantal Point Processes
EMExpectation Maximization
FC2Finding Clusters on Constraints
HXHyperbox
HXsHyperboxes
FMNNFuzzy Min–Max Neural Network
GFMNNGeneral FMNN
MSCFMNModified SCFMN
SCFMNSemi-supervised Clustering in FMNN
SSGCSemi-Supervised Graph-based Clustering
SSCSemi-supervised Clustering
SSFCSemi-supervised Fuzzy Clustering

References

  1. Li, F.; Yue, P.; Su, L. Research on the convergence of fuzzy genetic algorithm based on rough classification. In Advances in Natural Computation, Proceedings of the Second International Conference, ICNC 2006, Xi’an, China, 24–28 September 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 792–795. [Google Scholar]
  2. Pedrycz, W.; Waletzky, J. Fuzzy clustering with partial supervision. IEEE Trans. Syst. Man Cybern. Part B 1997, 27, 787–795. [Google Scholar] [CrossRef] [PubMed]
  3. Gabrys, B.; Bargiela, A. General fuzzy min-max neural network for clustering and classification. IEEE Trans. Neural Netw. 2000, 11, 769–783. [Google Scholar] [CrossRef] [PubMed]
  4. Endo, Y.; Hamasuna, Y.; Yamashiro, M.; Miyamoto, S. On semi-supervised fuzzy c-means clustering. In Proceedings of the 2009 IEEE International Conference on Fuzzy Systems, Jeju, Republic of Korea, 20–24 August 2009; pp. 1119–1124. [Google Scholar]
  5. Minh, V.D.; Ngan, T.T.; Tuan, T.M.; Duong, V.T.; Cuong, N.T. An improvement in integrating clustering method and neural network to extract rules and application in diagnosis support. Iran. J. Fuzzy Syst. 2022, 19, 147–165. [Google Scholar]
  6. Vu, V.V. An efficient semi-supervised graph based clustering. Intell. Data Anal. 2018, 22, 297–307. [Google Scholar] [CrossRef]
  7. Almazroi, A.A.; Atwa, W. Semi-Supervised Clustering Algorithms Through Active Constraints. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 7. [Google Scholar] [CrossRef]
  8. Shen, Z.; Lai, M.J.; Li, S. Graph-based semi-supervised local clustering with few labeled nodes. arXiv 2022, arXiv:2211.11114. [Google Scholar]
  9. Tran, T.N.; Vu, D.M.; Tran, M.T.; Le, B.D. The combination of fuzzy min–max neural network and semi-supervised learning in solving liver disease diagnosis support problem. Arab. J. Sci. Eng. 2019, 44, 2933–2944. [Google Scholar] [CrossRef]
  10. Bajpai, N.; Paik, J.H.; Sarkar, S. Balanced seed selection for K-means clustering with determinantal point process. Pattern Recognit. 2025, 164, 111548. [Google Scholar] [CrossRef]
  11. Sun, X. Semi-Supervised Clustering via Constraints Self-Learning. Mathematics 2025, 13, 1535. [Google Scholar] [CrossRef]
  12. Kämäräinen, J.K.; Kauppi, O.P.; Fränti, P. Clustering Datasets. Available online: https://cs.joensuu.fi/sipu/datasets/ (accessed on 2 August 2025).
  13. Khuat, T.T.; Gabrys, B. A comparative study of general fuzzy min-max neural networks for pattern classification problems. Neurocomputing 2020, 386, 110–125. [Google Scholar] [CrossRef]
  14. Simpson, P.K. Fuzzy min-max neural networks-part 2: Clustering. IEEE Trans. Fuzzy Syst. 2002, 1, 32. [Google Scholar] [CrossRef]
  15. Simpson, P.K. Fuzzy min-max neural networks. I. Classification. IEEE Trans. Neural Netw. 1992, 3, 776–786. [Google Scholar] [CrossRef]
  16. Aha, D. UC Irvine Machine Learning Repository. Available online: https://archive.ics.uci.edu/ (accessed on 2 August 2025).
  17. Batista, G.E.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
Figure 1. Illustration of fuzzy HXs in a two-dimensional Aggregation dataset, partitioned by the FMNN model. Each color represents a distinct cluster, and different shapes indicate data points belonging to corresponding fuzzy HXs.
Figure 1. Illustration of fuzzy HXs in a two-dimensional Aggregation dataset, partitioned by the FMNN model. Each color represents a distinct cluster, and different shapes indicate data points belonging to corresponding fuzzy HXs.
Information 16 00884 g001
Figure 2. Illustration of HX partitioning on the Aggregation dataset by FMNN-based models, including (a) data distribution of the Aggregation dataset; (b) data space partitioned into 7 HXs by MSCFMN (one per cluster); (c) data space partitioned into 17 HXs by FA-Seed with 13 queried points; (d) data space partitioned into 26 HXs by FA-Seed with 21 queried points; (e) data space partitioned into 38 HXs by FA-Seed with 29 queried points; and (f) data space partitioned into 52 HXs by FA-Seed with 40 queried points. Numbered HXs indicate the HXs selected for expert query/label seeding at that stage, while unnumbered HXs are skipped. Beyond increasing the number of HXs, FA-Seed also evaluates features through density, centrality, and uncertainty measures, which guide both seed selection and HX construction, thereby supporting the conceptualization of the clustering model.
Figure 2. Illustration of HX partitioning on the Aggregation dataset by FMNN-based models, including (a) data distribution of the Aggregation dataset; (b) data space partitioned into 7 HXs by MSCFMN (one per cluster); (c) data space partitioned into 17 HXs by FA-Seed with 13 queried points; (d) data space partitioned into 26 HXs by FA-Seed with 21 queried points; (e) data space partitioned into 38 HXs by FA-Seed with 29 queried points; and (f) data space partitioned into 52 HXs by FA-Seed with 40 queried points. Numbered HXs indicate the HXs selected for expert query/label seeding at that stage, while unnumbered HXs are skipped. Beyond increasing the number of HXs, FA-Seed also evaluates features through density, centrality, and uncertainty measures, which guide both seed selection and HX construction, thereby supporting the conceptualization of the clustering model.
Information 16 00884 g002
Figure 3. HX selection on the Aggregation dataset using FA-Seed. Balanced HXs (1–21) are retained for expert query/label seeding, whereas imbalanced HXs (a–e) are discarded at this stage. Numbered HXs indicate the selected HXs used for seeding, and letter-labeled HXs correspond to unselected ones. Some HXs share common boundary lines because their minimum and maximum vertices coincide in the feature space, reflecting adjacent partitions rather than overlapping regions.
Figure 3. HX selection on the Aggregation dataset using FA-Seed. Balanced HXs (1–21) are retained for expert query/label seeding, whereas imbalanced HXs (a–e) are discarded at this stage. Numbered HXs indicate the selected HXs used for seeding, and letter-labeled HXs correspond to unselected ones. Some HXs share common boundary lines because their minimum and maximum vertices coincide in the feature space, reflecting adjacent partitions rather than overlapping regions.
Information 16 00884 g003
Figure 5. Improved neural network architecture of FA-Seed. The selected features are normalized and provided to the input layer F A . The input attributes are then grouped into fuzzy HXs, which serve as the structural units for clustering. In layer F B , candidate seeds are scored, followed by seed evaluation and label propagation in F H . Finally, layer F C produces the final cluster assignments.
Figure 5. Improved neural network architecture of FA-Seed. The selected features are normalized and provided to the input layer F A . The input attributes are then grouped into fuzzy HXs, which serve as the structural units for clustering. In layer F B , candidate seeds are scored, followed by seed evaluation and label propagation in F H . Finally, layer F C produces the final cluster assignments.
Information 16 00884 g005
Figure 6. Stability analysis of FA-Seed across benchmark datasets.
Figure 6. Stability analysis of FA-Seed across benchmark datasets.
Information 16 00884 g006
Figure 7. Comparison of NoS values between FA-Seed and baseline algorithms on benchmark datasets, including the following: (a) Ecoli dataset, (b) Iris dataset, (c) Soybean dataset, (d) Thyroid dataset, (e) Yeast dataset, (f) Zoo dataset.
Figure 7. Comparison of NoS values between FA-Seed and baseline algorithms on benchmark datasets, including the following: (a) Ecoli dataset, (b) Iris dataset, (c) Soybean dataset, (d) Thyroid dataset, (e) Yeast dataset, (f) Zoo dataset.
Information 16 00884 g007
Figure 8. Comparison of RI values between FA-Seed and baseline algorithms on Benchmark datasets, including: (a) Ecoli dataset, (b) Iris dataset, (c) Soybean dataset, (d) Thyroid dataset, (e) Yeast dataset, (f) Zoo dataset.
Figure 8. Comparison of RI values between FA-Seed and baseline algorithms on Benchmark datasets, including: (a) Ecoli dataset, (b) Iris dataset, (c) Soybean dataset, (d) Thyroid dataset, (e) Yeast dataset, (f) Zoo dataset.
Information 16 00884 g008
Figure 9. Visual comparison chart of accuracy measures of FA-Seed and MSCFMN.
Figure 9. Visual comparison chart of accuracy measures of FA-Seed and MSCFMN.
Information 16 00884 g009
Figure 10. Average clustering performance RI achieved by the proposed FA-Seed and FC2, including the following: (a) Iris dataset, (b) Wine dataset.
Figure 10. Average clustering performance RI achieved by the proposed FA-Seed and FC2, including the following: (a) Iris dataset, (b) Wine dataset.
Information 16 00884 g010
Figure 11. Comparison of time (measured in Seconds) values between FA-Seed and baseline algorithms on benchmark datasets, including the following: (a) Soybean, Zoo, and Thyroid datasets, (b) Iris, and Ecoli datasets.
Figure 11. Comparison of time (measured in Seconds) values between FA-Seed and baseline algorithms on benchmark datasets, including the following: (a) Soybean, Zoo, and Thyroid datasets, (b) Iris, and Ecoli datasets.
Information 16 00884 g011
Table 1. Benchmark datasets used in the experiments.
Table 1. Benchmark datasets used in the experiments.
Dataset#Instances#Features#Clusters
Soybean47354
Zoo101167
Iris15043
Wine178133
Thyroid21553
Flame24022
Spiral31223
Pathbased31723
Ecoli33688
Jain37322
PID76882
Aggregation78827
Yeast1484810
ThyroidDisease7200213
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vu, D.M.; Nguyen, T.S. FA-Seed: Flexible and Active Learning-Based Seed Selection. Information 2025, 16, 884. https://doi.org/10.3390/info16100884

AMA Style

Vu DM, Nguyen TS. FA-Seed: Flexible and Active Learning-Based Seed Selection. Information. 2025; 16(10):884. https://doi.org/10.3390/info16100884

Chicago/Turabian Style

Vu, Dinh Minh, and Thanh Son Nguyen. 2025. "FA-Seed: Flexible and Active Learning-Based Seed Selection" Information 16, no. 10: 884. https://doi.org/10.3390/info16100884

APA Style

Vu, D. M., & Nguyen, T. S. (2025). FA-Seed: Flexible and Active Learning-Based Seed Selection. Information, 16(10), 884. https://doi.org/10.3390/info16100884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop