Semantic Non-Negative Matrix Factorization for Term Extraction

Nugumanova, Aliya; Alzhanov, Almas; Mansurova, Aiganym; Rakhymbek, Kamilla; Baiburin, Yerzhan

doi:10.3390/bdcc8070072

Open AccessArticle

Semantic Non-Negative Matrix Factorization for Term Extraction

by

Aliya Nugumanova

¹

,

Almas Alzhanov

^1,*

,

Aiganym Mansurova

¹

,

Kamilla Rakhymbek

² and

Yerzhan Baiburin

²

¹

Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan

²

Laboratory of Digital Technologies and Modeling, Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk 070000, Kazakhstan

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2024, 8(7), 72; https://doi.org/10.3390/bdcc8070072

Submission received: 16 May 2024 / Revised: 19 June 2024 / Accepted: 24 June 2024 / Published: 27 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study introduces an unsupervised term extraction approach that combines non-negative matrix factorization (NMF) with word embeddings. Inspired by a pioneering semantic NMF method that employs regularization to jointly optimize document–word and word–word matrix factorizations for document clustering, we adapt this strategy for term extraction. Typically, a word–word matrix representing semantic relationships between words is constructed using cosine similarities between word embeddings. However, it has been established that transformer encoder embeddings tend to reside within a narrow cone, leading to consistently high cosine similarities between words. To address this issue, we replace the conventional word–word matrix with a word–seed submatrix, restricting columns to ‘domain seeds’—specific words that encapsulate the essential semantic features of the domain. Therefore, we propose a modified NMF framework that jointly factorizes the document–word and word–seed matrices, producing more precise encoding vectors for words, which we utilize to extract high-relevancy topic-related terms. Our modification significantly improves term extraction effectiveness, marking the first implementation of semantically enhanced NMF, designed specifically for the task of term extraction. Comparative experiments demonstrate that our method outperforms both traditional NMF and advanced transformer-based methods such as KeyBERT and BERTopic. To support further research and application, we compile and manually annotate two new datasets, each containing 1000 sentences, from the ‘Geography and History’ and ‘National Heroes’ domains. These datasets are useful for both term extraction and document classification tasks. All related code and datasets are freely available.

Keywords:

NMF; semantic NMF; automatic term extraction; word embeddings; semantic relations

1. Introduction

Computational terminology encompasses a variety of methods aimed at the automatic extraction of stable lexical units from a given communicative context [1]. Depending on the form in which the context is presented, either as a corpus of domain text or a single document, two popular tasks are distinguished in computational terminology: term extraction and keyword extraction. Term extraction involves identifying words that consistently appear across a corpus of documents and which are specific to a domain, reflecting the specialized language and concepts of that domain. Keyword extraction focuses on identifying words that best describe the content of a single document and highlight its key information. Therefore, the primary difference between these tasks is determined by the fact that keywords are attributes of the text, while terms are attributes of the domain [1]. This generates a difference in terms of understanding the relevance of extracted lexical units.

According to the authors [2], the relevance of terms, as opposed to keywords, should be measured within the conceptual structure of the domain, not within the corpus. This perspective emphasizes the primary significant challenge of automatic term extraction, described by the authors of [2] as “the issue of coherence of terms as a set”, also known as the “apple-orange-banana” issue. Interpreting this issue, if the term “apple” is recognized within a text corpus, semantically close terms like “orange”, “banana”, and even “durian” should also be extracted, regardless of how frequently these words appear in the corpus [2]. This issue clearly underscores the need to reach beyond the corpus for term extraction, requiring the utilization of external knowledge bases. Word embedding represents one of the compact and contemporary methods for capturing domain knowledge. Hence, our primary goal lies in leveraging word embeddings to expand the semantic scope of the corpus and provide the most relevant depiction of the domain.

On the flip side of the coin, the issue of “coherence of terms as a set” is mirrored by the issue of “generalization across domains”. Both issues are two sides of the same problem, relating to how domain knowledge should be organized, which is a fundamental aspect of cognition, learning, and artificial intelligence. The concept of “coherence of terms as a set” focuses on the internal consistency and logical arrangement of terms/concepts within a domain. This ensures that the terms/concepts grouped together under a category share common defining characteristics that make the group meaningful. Conversely, “generalization across domains” concerns the application of these coherent groups of terms/concepts to new and varied domains. This involves abstracting knowledge from one context and adapting it to others, testing the boundaries of how far the underlying principles of a category can be applied. Together, these sides address both the depth and breadth of knowledge organization.

Generalization across domains is the vulnerable point of most modern knowledge discovery frameworks based on transformer models [3]. Transformers, which are state of the art in deep learning, achieve top performances in many natural language processing (NLP) tasks, including both language understanding and language generation (e.g., BERT, GPT-4, LLaMA etc.). In the context of term extraction, NLP techniques are employed to identify and extract relevant lexical units from text data, thereby facilitating better knowledge discovery and engineering. However, the performance of transformers dramatically drops when they are applied to a new communicative context—whether a new domain or language. In such cases, they must re-learn or undergo fine-tuning using new data from the new domain or language. For high-tech domains, the primary targets of most commercial knowledge discovery platforms, annotating training data is expensive due to the need for highly specialized experts [3]. Despite the recent emergence of promising transfer learning studies, offering hope for generalizing supervised approaches [4,5,6,7,8], there is still room for improvement in supervised term extraction and broader knowledge discovery tasks.

Unlike their supervised counterparts, unsupervised term extraction methods do not require training data and are less affected by domain shifts. However, given this invariance to the domain, unsupervised methods usually exhibit lower performance [9]. This sets the stage for our efforts. While our primary motivation is to enhance the semantic understanding of the domain using word embeddings, incorporating semantics can also improve the performance of unsupervised methods, specifically, the non-negative matrix factorization (NMF) approach, the enhancement of which is our secondary, yet equally critical, motivation. The standard NMF framework for topic modeling focuses on factorizing a sparse document–word matrix, which represents the distribution of words across documents, into two matrices: document–topic and topic–word. The document–topic matrix is typically used to cluster documents by topics, while the topic–word matrix is employed to extract the most valuable words from these topics. The hypothesis is that incorporating semantics, encoded by word embeddings, can mitigate the effects of sparsity in the document–word matrix and enhance the effectiveness of both document clustering and term extraction.

The authors of [10] were the first to propose such an integration, suggesting the use of semantic NMF to enhance document clustering. Their method, both elegant and powerful, involves the joint factorization of the sparse document–word matrix and the semantically rich word–word matrix derived from word embeddings into two distinct products, sharing a common topic–word factor. This approach draws semantically related words closer in the latent topic space, enhancing the representation of document–topic relationships and thereby improving document clustering. Inspired by their novel, theoretically grounded algorithm—an extension of the widely recognized NMF by Lee and Seung [11]—we adapted it for term extraction.

To the best of our knowledge, while some studies have applied semantically enriched NMF for document clustering, we are the first to utilize it for term extraction. We attribute this to the fact that the solution to document clustering is typically sought in a lower-dimensional space compared to that of term extraction. Typically, the word–word matrix, representing semantic relationships between words, is constructed using cosine similarities between word embeddings. However, it has been established that embeddings from transformer encoders tend to reside within a narrow cone, leading to consistently high cosine similarities between words [12]. This phenomenon can limit the power of the word–word matrix in distinguishing between domain terms and those that are merely co-located within the same dense area. While this does not negatively impact the effectiveness of semantic NMF for document clustering, it is crucial for term extraction. To address this issue, we substitute the word–word matrix with a word–seed submatrix, limiting its columns to ‘domain seeds’—specific words that capture the core semantic characteristics of the domain.

We compared our approach with traditional NMF [9] and the recent, but already popular, transformer-based methods KeyBERT and BERTopic [13,14]. The results demonstrated the significant superiority of semantic NMF across two related datasets from the Geography & History and National Heroes domains, which were generated using GPT-4 and subsequently manually checked and annotated. Therefore, the main contributions of our study are as follows.

Novel semantic NMF method for term extraction: We introduced a novel unsupervised method for automatic term extraction that utilizes the semantic NMF algorithm.
Novel construction of the word–word matrix using seed words: We proposed a novel way to construct the word–word matrix, a cornerstone of semantic NMF, by introducing the concept of domain seeds. In fact, we replaced the word–word matrix with a word–seed submatrix.
Development of annotated datasets: We generated two datasets, each consisting of 1000 sentences from the Geography & History and National Heroes domains, and manually annotated them for both term extraction and document clustering/classification tasks. We made these datasets freely available.

2. Related Work

NMF is not commonly used for term extraction, although it generates two important matrices: the document–topic matrix, which shows how documents are distributed across topics, and the topic–word matrix, which shows how words are distributed across topics. When examining topics as deeper explorations of the domain, the words with the highest weights within each topic are usually key to understanding that domain. By focusing on these heavily weighted words, NMF can effectively highlight the essential terms of the domain. However, the most vulnerable aspect of this concept is data sparsity, which results in the underrepresentation of infrequently used terms that, despite their importance to the domain, receive less emphasis in topic representations. Semantic NMF enhances the traditional NMF approach by integrating semantic information, which improves text analysis capabilities and mitigates data sparsity. While numerous studies in this field exist, they predominantly focus on document clustering.

One pioneering study in this field introduces a semi-supervised NMF (SSNMF) that performs joint factorization of data and label matrices, using a shared factor matrix to ensure consistency [15]. A vector of class labels is converted into a binary matrix, where ‘1‘ appears in each row, but only at the position corresponding to a specific class, with all other items in the row set to ‘0‘. Additionally, binary weights are employed to manage cases where class labels are unknown or when data entries in the data matrix are missing. The experimental results demonstrate that semi-supervised NMF enhances classification performance by incorporating labeled data samples. A two-step semi-supervised semantic NMF approach is introduced in [16]. The first step involves extracting interpretable topics from labeled data, and the second step uses these topics to guide the refinement of final bases into semantically interpretable forms. This semantic NMF approach also outperforms conventional NMF.

The highly effective semantic NMF (SeNMFk) approach is detailed in [17,18]. SeNMFk conducts joint factorization of the document–word matrix using the TF-IDF metric and the word–context matrix, utilizing Kullback–Leibler divergence. Moreover, the determination of the number of latent factors, k, is based on custom resampling rather than an empirical approach. The authors of [19] expand SeNMFk with a focus on large datasets. Their novel distributed approach, named SeNMFk-SPLIT, enables the joint factorization of large documents by separately decomposing the word–context and term–document matrices. The previously mentioned work [10], which served as our source of inspiration for this study, also falls into the same category of research [17,18,19] that connects standard NMF with the semantics hidden in dense or distributional encodings of words.

An interesting empirical approach is combined with semantic NMF in [20]. To encourage certain ‘important’ words from the corpus vocabulary to play a larger role in factorization, the rows corresponding to these words in the word–document matrix are directly multiplied by a large constant coefficient. Combining the highlighting of keywords with semantic NMF allows for the formation of significant topics that align with desired prior keywords. In [21], keywords provided under user supervision, which are referred to as seed words, and groups of these words, which are known as seed topics, are utilized. Guided NMF (GNMF) is formulated as the joint regularization of the data matrix and a seed matrix, which is constructed from these seed topics. The authors of [22] extend the semi-supervised and guided semantic NMF approaches by introducing guided semi-supervised semantic NMF (GSSNMF), which incorporates both label information and seed words. Compared to SSNMF, GSSNMF more accurately classifies documents, and it generates more coherent topics relative to GNMF.

The word–word matrix, representing semantic relationships among words, can be directly factored using symmetric NMF techniques [23], without the need for joint factorization with the document–word matrix. Symmetric NMF factors a pairwise similarity matrix into the products of a non-negative matrix and its transposed matrix, thereby extending standard NMF from an inherently dimensional reduction technique to a general clustering method. In [24], a semi-supervised symmetric NMF model is proposed, capable of simultaneously learning the similarity matrix with supervisory information in the form of pairwise ML (must link) and CL (cannot link) constraints, and of generating higher-quality clusters. The idea of ML and CL constraints is also used in [25], albeit with different penalty strategies. ML constraints are employed to control the distance of the data in the basis matrix, while CL constraints regulate the encoding vector matrix. In both studies [24,25], symmetric NMF, enhanced with supervisory information in the form of ML and CL constraints, demonstrates a superior quality of clustering.

A multistep approach to semantic topic modeling is used in [26]. This study enhances traditional NMF by incorporating semantic text representation using word sense disambiguation and lexical semantic similarity measures. The model refines the term–topic matrix through semantic low-pass filtering and tag edge detection to improve topic coherence and stability. The final topic–document matrix is constructed by integrating the refined term–topic matrix with the document–term matrix.

As the review of related work indicates, semantic NMF captured significant interest from researchers for its potential to enhance topic expressiveness, cohesion, and document clustering performance. This primarily affects the right factor in the NMF matrix product, the document–topic matrix. However, there remains a research gap concerning how semantic enhancements influence the left factor, the topic–word matrix. This study addresses this gap by experimentally demonstrating the feasibility of these enhancements.

3. Methodology

Figure 1 illustrates the generic pipeline of the proposed methodology, which consists of four principal steps:

The creation of the document–word matrix. This matrix represents the distribution of words across the documents in the corpus. Each entry in the matrix reflects the frequency of a word’s occurrence in a specific document, providing a foundational data structure for further analysis.
The creation of the word–seed matrix. This matrix is formed by calculating pairwise cosine similarities between the embeddings of words and predefined seed words. Seed words are selected by a domain expert based on their relevance to the domain, helping to anchor the semantic space around key concepts.
The joint factorization of the created matrices. The document–word and word–seed matrices are jointly factorized to produce a semantically refined topic–word matrix. This step applies the semantic NMF algorithm and allows for a deeper, more meaningful extraction of topics that are semantically coherent.
The extraction of most relevant terms. From each topic in the resulting topic–word matrix, the most relevant terms are extracted. These terms are identified based on their weighted contribution to each topic, highlighting the terms that best define and represent the underlying topics within the domain.

Below, we provide a brief background on NMF, describe the standard NMF algorithm for term extraction, and then move on to the semantic NMF algorithm, detailing how to construct the word–seed matrix.

3.1. Standard NMF for Term Extraction

Non-negative matrix factorization (NMF) aims to closely approximate the non-negative matrix A of dimension

m \times n

with the product of two non-negative matrices W and H of dimensions

m \times k

and

k \times n,

respectively, such that

A \approx W H

. Here, k is smaller than both m and n, indicating a reduction in dimensionality. This allows us to capture the core features of the data represented by matrix A. In this factorization, the matrix W is referred to as the basis matrix, while the matrix H is known as the coefficient matrix or vector-encoding matrix. The approximation

A \approx W H

is achieved by finding matrices W and H that minimize the cost function

L (A, W H)

. The cost function (also referred to as the error function or objective function) quantifies the discrepancy between the original matrix A and the product WH using either the Frobenius norm or the Kullback–Leibler divergence. Below, NMF is formulated as a non-convex optimization problem, employing the Frobenius norm

{‖\cdot‖}_{F}

as the cost function:

\min_{H, W} L (A, W H) = \min_{H, W} {‖A - W H‖}_{F}^{2} = \min_{H, W} \sum_{i j} {(A_{i j} - {(W H)}_{i j})}^{2}

(1)

A non-convex solution space for NMF implies the existence of multiple local minima, none of which are guaranteed to be a global minimum. The renowned algorithm developed by Lee and Seung [11] solves Equation (1) through iterative multiplicative transformations of matrices W and H, starting from random initial matrices. The algorithm ensures a monotonic, step-by-step decrease in the cost function; however, like other NMF algorithms, it does not necessarily lead to the best possible decomposition, i.e., the global minimum. Below are the multiplicative rules for each iteration step of the algorithm, where the symbol ⊙ denotes element-wise multiplication, also known as the Hadamard product:

\begin{matrix} H \leftarrow H ⊙ \frac{W^{T} A}{W^{T} W H} \\ W \leftarrow W ⊙ \frac{{A H}^{T}}{W H H^{T}} \end{matrix}

(2)

The algorithm stops based on one of the following conditions: (1) a convergence criterion, which is met when the decrease in the cost function between successive iterations falls below a predetermined threshold; (2) a maximum number of iterations, which prevents the algorithm from running indefinitely, especially if convergence is slow or the algorithm oscillates; (3) the stability of factors, which occurs if the matrices W and H stabilize. The choice of initial values for matrices W and H can significantly affect the results of NMF, including the speed of convergence and the quality of the local solution obtained [27,28]. Various initialization strategies are used, such as random assignments, data-based initialization, or special heuristic methods [27].

These features render NMF exceptionally useful in NLP tasks such as topic modeling, document clustering, and term extraction, where finding an acceptable local solution often suffices. In these applications, the ability to reveal the underlying structure of the data is often more critical than achieving absolute solution optimality. This focus on structure over optimality aligns with key principles of machine learning, where understanding the underlying patterns can lead to more insightful outcomes. This is often discussed in the context of overfitting, which represents a model that is “too optimal” for the training data but fails to generalize and effectively capture the broader structure [29]. By emphasizing the discovery of data structure, NMF supports their interpretation, which is particularly valuable in domains like text analytics.

Let us formulate the NMF task for topic modeling, along with its subsequent applications in document clustering and term extraction. Given a corpus consisting of

m

documents and a vocabulary associated with this corpus comprising

n

words, let

A

be a document–word matrix of dimension

m \times n

. The element

a_{i j}

of matrix

A

represents the frequency of occurrences of the

j

-th word in the vocabulary within the

i

-th document of the corpus

(i = 1, \dots, m, j = 1, \dots, n) .

The task requires the approximate decomposition of the document–word matrix A (m × n) into the product of document–topic matrix W (m × k) and topic–word matrix H (k × n), such that

A \approx W H

. Here, k indicates the number of core topics in the corpus represented by matrix A, as shown in Figure 2. For the task of document clustering, the basis matrix

W

becomes particularly intriguing due to its role in grouping documents based on similar topics. Conversely, for the task of term extraction, the matrix of encoding vectors,

H

, is of significant interest because it captures the structural organization of topics through words.

The approach to term extraction using NMF is straightforward: extract the top

T

words with the highest coefficients from each of the

k

topics in the transposed matrix

H^{T}

, resulting in kT candidate terms, as shown in Figure 3. However, since a word can appear at the top of several topics (having high coefficients in multiple topics), the final number of unique candidates may be significantly less than kT after duplicates are removed. As noted in [30], a close examination of the encoding vectors produced by NMF reveals a high degree of overlapping, indicating that some words can be representative of multiple topics.

Below, we demonstrate the application of NMF to extract terms from a corpus of 14 documents concerning the topics of China and Japan. The selection of topics is intentional, drawing from an inspiring example in a classic textbook on information retrieval (refer to [31], ch.13). The original example explores a nuanced case of binary classification involving five documents, as shown in Table 1. Our enhancement of this example aims to offer a clearer understanding of NMF’s capabilities, assessing standard versus semantic capacities. Consequently, our example is continuous, and we will continue it in Section 3.2 and Section 3.3 of this Section.

A fragment of our corpus is depicted in Table 2. To simplify our analysis, we construct a document–word matrix from this corpus, only using named entities such as ‘GPE’ (geopolitical entities), ‘LOC’ (locations), and ‘NORP’ (nationalities, religious and political groups) as features, instead of the entire corpus vocabulary. Despite the feature reduction, the resulting matrix has a high level of sparsity, a typical attribute observed in NLP applications (see Figure 4). ‘Japan’, ‘Chinese’ and ‘China’ are the most frequently occurring words in the corpus, appearing 9, 6 and 5 times, respectively. For comparison, rare words such as ‘Hokkaido’ or ‘Guangzhou’, which appear only once in the corpus, account for more than half of the total feature set. Applying NMF to the document–word matrix, we then concentrate on the topic–word matrix H and extract the top 5 words from each of the two topics. As we can see from Figure 5, the top 5 words in both topics include ‘Yokohama’ and ‘China’. This confirms on one hand that NMF allows for overlap in the encoding vectors, and on the other hand, it indicates the insufficient separation of topics in the original corpus, where the cross-topics words ‘Yokohama’ and ‘Chinese’ appear in the same document. Besides the pair of ‘Yokohama’ and ‘Chinese’, the words ‘Nanjing’ and ‘Japan’ also co-occur in the same document, resulting in ’Nanjing’ being placed into a topic that is not its own.

3.2. Semantic NMF for Term Extraction

Word embeddings are dense vector representations of words learned by neural networks [32]. These neural embeddings are trained to capture linguistic contexts, so that similar words have vectors that are closely positioned within the dense vector space, efficiently encoding both semantic and syntactic relationships among words. In the semantic formulation of NMF, besides the original document–word matrix A, a word–word matrix M of dimension n × n is introduced [10]. The element

m_{i j}

of matrix M represents the cosine similarity between the i-th and j-th word embeddings in the dense vector space (i

= 1, \dots, n

,

j = 1, \dots, n

). Without loss of generality, matrix M can also be defined as a word–feature matrix of dimension n × n′ [10], so the i-th row of

M

represents the embedding of the i-th word in a special feature space.

A joint decomposition of the document–word matrix

A

and the word–feature matrix

M

into two separate products

A \approx W H

and

M \approx H^{T} Q

is required, sharing a common factor

H

(see Figure 6). Thanks to the joint factorization, it is expected that topic–word

H

matrix will place semantically related words closer to each other in the compressed space of topics of dimension k [10]. Thus, semantic NMF employs a complex cost function L, which, as in standard NMF, can be defined using the Frobenius norm:

\min_{H, W, Q} L (A, W, H, M, Q) = \min_{H, W, Q} (L (A, W H) + λ \cdot L (M, H^{T} Q)) = \min_{H, W, Q} ({‖A - W H‖}_{F}^{2} + {λ ‖M - H^{T} Q‖}_{F}^{2})

(3)

where the parameter λ is used for regularization. When it is greater than 1, the matrix M has a more significant impact on the NMF result. Conversely, when λ is less than 1, the matrix A has a more significant impact. When λ is 0, semantic NMF degenerates into the standard NMF.

The semantic NMF algorithm [10] operates in a manner like the standard Lee and Seung NMF algorithm [11]. At the initial step, matrices W, H, and Q of dimensions

m \times k

,

k \times n

, and

k \times n^{'}

, respectively, are randomly generated. Subsequently, iterative updates of matrices W, H, and Q are performed until one of the stopping criteria listed above, is met:

\begin{matrix} W \leftarrow W ⊙ \frac{{A H}^{T}}{W H H^{T}} H \leftarrow H ⊙ \frac{W^{T} A + λ Q M^{T}}{W^{T} W H + λ Q Q^{T} H} \\ Q \leftarrow Q ⊙ \frac{H M}{H H^{T} Q} \end{matrix}

(4)

Below, we extend our example, initiated in Section 3.1 of the current section. To construct the matrix M, embedding vectors are obtained for each word considered in the document–word matrix using the FastText of version 0.9.3 ‘wiki-news-300d-1M-subword‘ embedding model [33]. Subsequently, we calculate the cosine similarities between these embedding vectors. Figure 7 displays the resultant matrix M. Applying semantic NMF to these two matrices

A

and

M

, we then focus on the topic–word matrix H and extract the top 5 words from each of the two topics. As illustrated in Figure 8, the application of Semantic NMF with λ = 3 results in clearly distinct topics, showcasing the model’s effectiveness in achieving thematic separation. This enhanced clarity is a direct consequence of incorporating the semantic relationships captured in matrix M.

3.3. Constructing the Word–Seed Matrix for Semantic NMF

The symmetric word–word matrix

M

of dimension

n \times n

, which integrates relationships among words into the semantic NMF algorithm, is a particular case within a broader class of matrices,

M

, of dimension

n \times n^{'}

, offering word representations in various semantic feature spaces. For instance, the word–word matrix

M

of dimension

n \times n

can be easily replaced by the word embedding matrix of dimension

n \times 300

, where 300 is the common length of embedding vectors in the well-known word2vec model [32]. In this study, we expand on the idea of mapping words into a compressed space of semantic features, which we refer to as domain seeds. Therefore, we replace the original word–word matrix, which represents pairwise cosine similarities between word embeddings, with a reduced submatrix that keeps all rows but only includes the columns corresponding to seed words (see Figure 9).

Below, we continue our example from Section 3.1 and Section 3.2, using three specific seed words, namely, ’Japan’, ’Chinese’, and ’China’, to reduce our word–word matrix

M

(see Figure 10a). We apply semantic NMF using reduced matrix

M

. Despite the reduction, the semantic NMF algorithm still effectively distinguishes between topics, as evidenced by the clear topic separation shown in Figure 10b,c.

4. Experiments and Results

4.1. Datasets

To evaluate semantic NMF for term extraction, two datasets, entitled titled ‘Geography & History’ (GH) and ‘National Heroes’ (NH), were created using ChatGPT-4, with no more than 5% of the sentences derived from Wikipedia. Each dataset comprises 1000 sentences, manually annotated and evenly classified into 10 categories corresponding to the following countries: China, Egypt, Greece, Iran, Japan, Kazakhstan, Mongolia, Russia, Turkey, and Uzbekistan. Each sentence in the datasets contains one or more terms related to the corresponding country. In the GH dataset, terms include geographical names, referred to as geonyms. In the NH dataset, the terms include the names of famous people, deities, and rulers, referred to as anthroponyms or theonyms.

The process of creating each dataset involves three steps: (1) the selection of terms; (2) the generation and annotation of sentences; and (3) the manual verification of terms and sentences. During the first step, two experts manually select terms for each of the 10 countries included in the datasets. The terms should meet three criteria: (1) they should be culturally connected and commonly associated with the country, and preferably rarely mentioned outside of the original geographical and cultural context; (2) they should be recognizable at the international level; and (3) they should be present in the trained FastText embedding model.

At the second step, experts generate sentences containing selected terms with the help of Chat-GPT-4. For example, for the GH dataset, their requests can be structured as follows: ‘Generate a sentence where the geonym XXX associated with the country YYY is mentioned’. This is as illustrated in Figure 11. For the NH dataset, the requests can be structured as follows: ‘Generate a sentence where the anthroponym/theonym XXX is mentioned, such that the sentence also includes the geonym YYY associated with the country ZZZ’, as shown in Figure 12. At the third step, the outputs of ChatGPT-4 are manually checked for reliability and consistency, with any sentences that exhibit anomalies or hallucinations being corrected or refined. Table 3 provides random examples of sentences from both datasets for each of the 10 countries.

Detailed datasets statistics are presented in Table 4. As indicated in the Table 4, although all countries are represented by the same number of sentences (100 each), Uzbekistan accounts for the highest number of unique geonyms in the GH dataset, while Iran has the highest number of unique anthroponyms and theonyms in the NH dataset.

If we sum up the number of unique terms in Table 4, we obtain 653 terms for the GH dataset and 476 terms for the NH dataset. For the NH dataset, this number matches the real number of unique terms, since classes do not share common terms. For the GH dataset, classes share common terms; for example, the term ‘Altai’ occurs in sentences of the Russian, Mongolian, and Kazakhstani classes. So, 653 is the number of terms with duplicates, whereas the number of unique terms in the GH dataset is 461. Regarding words, the most frequently used word in the GH dataset is ‘city,’ which appears 136 times, while the most frequently used word in the NH dataset is ‘born,’ which appears 159 times. Table 5 displays the top 10 words for both datasets. Regarding terms, the most frequent geonyms in the GH dataset are ‘Central’ and ‘Kazakhstan,’ while the most frequent anthroponyms in the NH dataset are ‘Chingis’ and ‘Khan.’ Figure 13 illustrates term clouds for both datasets.

4.2. Experimental Settings

This subsection outlines the experimental settings used to assess the efficiency of semantic NMF for term extraction. The experiments aim to evaluate the precision, recall, and F1 scores of semantic NMF when applied to the GH and NH datasets. The experimental setup involves defining four key parameters of semantic NMF approach:

Number of seeds.
Number of topics (k).
Regularization parameter (λ).
Number of top words extracted from each topic (T).

4.2.1. Selection of the Number of Seeds

As described earlier, the word–seed matrix represents the cosine similarities between corresponding embeddings of the vocabulary words and seed words. Embeddings are generated using the FastText pre-trained model. Two different sets of seed words are selected to evaluate the impact of seeding on the performance of semantic NMF: (1) a set of 20 seed words, with 2 seed words per class; (2) a set of 30 seed words, with 3 seed words per class (see Table 6).

The seed words used in this study are manually chosen based on expert prior knowledge of the domain. This manual selection utilizes domain-specific expertise to ensure that the chosen seeds are highly relevant to the targeted topics. Alternatively, techniques such as TF-DCF [34] can be employed to automatically extract seed words from the text corpus. This method identifies words that are significant within the domain corpus relative to the other corpora. The ablation study on the number of seeds is provided in Table A4 in Appendix A.

For the GH dataset, the size of the corpus vocabulary is 2923 words, so the dimensions of the document–word matrix are 1000 × 2923, and the dimensions of the word–seed matrices are 2923 × 20 and 2923 × 30 for 20 and 30 seed words, respectively, depending on the number of seed words. For the NH dataset, the size of the corpus vocabulary is 3695 so the dimensions of the document–word matrix are 1000 × 3695, and the dimensions of the word–seed matrices are 3695 × 20 and 3695 × 30 for the samples with 20 and 30 seed words, respectively.

4.2.2. Selection of the Number of Topics k

The selection of the number of topics (k) for semantic NMF poses a significant challenge. Although methods such as the elbow method or the evaluation of model coherence provide some guidance, no universally accepted approach ensures the optimal selection of k across diverse contexts. The choice of k is profoundly influenced by the complexity of the underlying data and the specific objectives of the analysis. In this study, we set the number of topics to 10, corresponding to the number of predefined classes (countries) in both datasets. Results of the initial experiments with k varying from 2 to 20, as presented in Table A1 in Appendix A, partially confirm our intuitive settings.

As shown in Figure 14, the optimal number of topics k for the GH dataset corresponds with the predefined 10 classes. This alignment can be explained by the clear division of geo-historical terms based on territorial principles. However, for the NH dataset, the optimal number of topics is not only 10 but also 4. We can explain this by the fact that national heroes of the past and present cannot be strictly divided by territorial principles. For instance, names of heroes from countries of the former socialist block (Russia, Kazakhstan, Mongolia, and Uzbekistan) ended up in the same topic, while another topic included names of prominent political leaders (Stalin, Atatürk, etc.). Clearly, the embeddings pre-trained on diverse semantic relationships significantly influenced this division.

Adding to this, the grouping of such varied historical figures under broad topical umbrellas underscores the potential and challenges of semantic NMF. It illustrates how deeply embedded cultural and historical connections can transcend contemporary geopolitical boundaries, revealing complex interrelations that traditional clustering algorithms might miss. This aspect of semantic embeddings showcases their ability to provide nuanced insights into data. However, this also poses a challenge as it requires oversight of the semantic structures and complicates the interpretation of the results.

4.2.3. Selection of the Regularization Parameter λ

The regularization parameter λ is tested from 1 to 35 in the ablation study in, the results of which are shown in Table A2 in Appendix A. The results indicate that higher values of λ improve the F1-score, with no further increase beyond 30 as shown in Figure 15. Therefore, we fix λ at 30.

4.2.4. Selection of the Number of Top Words T

The number of top words T extracted from each topic is set to 70 both for the GH and NH dataset, resulting in approximately 700 words from 10 topics. After removing duplicates, this yields about 450–550 unique words, approximately matching the number of true terms. The ablation study on the number of extracted top words T is provided in Table A3 in Appendix A. The semantic NMF is also evaluated using anti-seeds, which are frequently occurring words in the dataset that are used to identify and eliminate topics with generic words. Anti-seeds are selected from the most frequent words. Specifically, we identify the most common words in each dataset, as these words tend to be generic and less informative for distinguishing specific topics of interest. We begin by analyzing the word frequency in the entire dataset. This involves creating a frequency distribution of all words. From this frequency distribution, we select the top words as anti-seeds. For our study, we chose the top 10 most frequent words as these words were manually inspected to be generic for the topics of interest for each dataset. These words can be found in Table 5. Consequently, four experimental setups are defined, as shown in Table 7.

4.3. Baselines

To compare the effectiveness of semantic NMF approaches to term extraction, we evaluate three established baselines: standard NMF, BERTopic and KeyBERT. For semantic NMF, standard NMF, and BERTopic, we use the same number of topics

k

and the same number of extracted top words

T

. BERTopic is a topic-modeling approach that employs BERT embeddings along with c-TF-IDF to generate dense clusters. As a default, BERTopic uses HDBSCAN to perform its clustering; however, it was changed to K-means, allowing us to specify the number of topics to extract. By standardizing these critical parameters across these three methods, the evaluation becomes focused solely on the effectiveness of each method within the same structural framework. This setup ensures that differences in performance are attributable to the intrinsic capabilities of the methods themselves rather than external settings, making the comparison as “apple-to-apple” as possible.

Unlike the other two baseline methods, KeyBERT does not align perfectly in terms of parameter settings for a direct comparison. Specifically, for KeyBERT, the number of terms extracted is set to be substantially higher than that for the other methods, which inherently increases its recall. This adjustment is necessary because KeyBERT is designed to extract keywords rather than terms, necessitating the specification of the number of words to be extracted from each sentence. As a result, setting the minimum number of words to be extracted to 2 elevates the total number of lexical units to 2000, significantly more than its counterparts. However, despite these methodological divergences, KeyBERT remains a crucial part of our analysis due to its high effectiveness and reliance on BERT-based embeddings. Like semantic NMF, KeyBERT provides seeds in order to softly manage desired terms. However, initial experiments show that seeding is unnecessary when extracting only two words per sentence with KeyBERT. Table 8 shows the results of the baselines.

NMF, BERTopic, and KeyBERT allow for term repetition when extracting from different topics or documents (unlike e.g., the TF-IDF method which penalizes repetitions). Therefore, the column “Number of extracted terms” in Table 8 contains two numbers separated: the first indicates the total number of extracted terms, and the second shows the number of extracted terms after duplicates are removed.

5. Results and Discussion

Table 9 provides the results for each of the described experimental setups for both datasets. For the GH dataset, the results show an improvement in precision, recall, and F1-score with increasing the number of seeds, as indicated by Setup 1 and Setup 2. Furthermore, the use of anti-seeds (Setups 3 and 4) appears to increase the precision notably while slightly lowering the recall compared to setups without anti-seeds. The NH dataset results reveal a more moderate overall performance, but there is an increase in F1-score for Setup 4 compared to Setup 2 indicating the impact of using anti-seeds. The setups show better performance across both datasets when compared to the standard NMF.

The F-score of semantic NMF for the GH dataset is significantly superior to the F-scores of both baselines, whereas this superiority is maintained for the NH dataset when compared to standard NMF and BERTopic. Closer analysis of the results between semantic NMF and KeyBERT reveals that the precision of semantic NMF is higher, but its recall is much lower, which accounts for KeyBERT’s superior performance. This is primarily because semantic NMF is at a disadvantage: it extracts only 700 words, 70 for each of the 10 topics, whereas KeyBERT extracts 2000 words, 2 for each sentence. The broader the set of words, the greater the likelihood of a term appearing in it, thereby achieving the greater recall. There is a sensitive point in our approach when the optimal number of words to be extracted for a balance between precision and recall is unknown. Expanding the word set excessively leads to an inevitable increase in recall and a corresponding decrease in precision.

Consequently, additional experiments were conducted for semantic NMF, with the number of extracted words ranging from 1000 to 2000 to ensure parity with KeyBERT. Semantic NMF achieved its best F-score of 58.9 with 1400 words, still outperforming KeyBERT, which achieved its best score of 57.6. Similarly, for the sake of consistency, we conducted additional experiments for standard NMF and BERTopic on the NH dataset, with the number of extracted words ranging from 1000 to 2000. The best results for all three considered methods are provided in Table 10. The graphical representation of the best F-scores is given in Figure 16.

The differing results between datasets require further investigation, a question we leave for future research. The most apparent explanation is the higher cohesion of the first dataset, which yields a more compact term space.

6. Conclusions

This research explores the effectiveness of the semantic NMF approach for term extraction within two datasets. Our experiments indicate that semantic NMF significantly outperforms both traditional NMF and KeyBERT, a model that employs deep learning-based embeddings to identify keywords. The superior performance of semantic NMF can be attributed to its enhanced ability to capture and utilize the semantic relationships among words. Comparatively, standard NMF, though effective in identifying prevalent topics by decomposing the document–word matrix, often fails to capture the semantic relationships between words due to its reliance on surface-level frequency-based context analysis. KeyBERT, while innovative in its use of BERT embeddings for a deeper understanding of textual context, still cannot consistently grasp the broader topic structures that semantic NMF can since it works at the sentence level rather than the domain level.

However, a significant limitation of semantic NMF is its dependence on high-quality embeddings. When the number of candidate words is large, the resulting word–word similarity matrix based on embeddings tends to overestimate relationships that are not relevant to the domain. This excess of irrelevant high-scoring relationships introduces noise into the factorization process and reduces the accuracy of term extraction. This limitation became especially apparent when using semantic NMF to analyze bigrams or trigrams, as this significantly increases the number of relationships that are highly scored by embeddings, thereby introducing noise into the space.

In the related work section, we discuss approaches that can manipulate the similarity matrix by introducing ML (must link) and CL (cannot link) supervisory information, and this will be our focus for future research. By refining these methods, we aim to reduce the noise and enhance the accuracy of term extraction, particularly for multi-word terms. Addressing these challenges will be crucial for improving the robustness and applicability of semantic NMF in various domains.

Author Contributions

Conceptualization, A.N.; methodology, A.N.; software, A.A., K.R. and A.N.; validation, A.N. and A.A.; formal analysis, Y.B.; data curation, A.M. and A.N.; writing, A.N. (Section 1, Section 2, Section 3, Section 5 and Section 6), A.M. (Section 4.1), A.A. (Section 4.2 and Section 4.3); visualization, A.N. and K.R.; supervision, A.N.; project administration, Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan, grant number AP19677756.

Data Availability Statement

The dataset is available in the Hugging Face repository at https://doi.org/10.57967/hf/2627 (accessed on 23 June 2024) and the code is available in the GitHub repository https://github.com/Almas-Alz/semanticNMF-term extraction (accessed on 23 June 2024).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Definition
NMF	non-negative matrix factorization
BERT	bidirectional encoder representations from transformers
KeyBERT	keyword extraction technique that leverages bidirectional encoder representations from transformers
BERTopic	bidirectional encoder representations from transformers for topic modeling
NLP	natural language processing
SSNMF	semi-supervised non-negative matrix factorization
SeNMFk	semantic non-negative matrix factorization/semantic assisted non-negative matrix
TF-IDF	term frequency–inverse document frequency
SeNMFk-SPLIT	semantic non-negative matrix factorization—SPLIT
GNMF	guided non-negative matrix factorization
GSSNMF	guided semi-supervised non-negative matrix factorization
ML	must link
CL	cannot link
GH	Geography & History
NH	National Heroes

Appendix A. Ablation Study of Number of Topics k, Regularization Parameter λ, Number of Extracted Top Words T and Number of Seeds

Table A1. Number of topics k ranging from 2 to 20.

Dataset	Number of Topics k	Regularization λ	Number of Seeds	Number of Extracted Words	Precision	Recall	F1-Score
GH	2	30	30	1000/764	0.573	0.952	0.716
	3	30	30	999/707	0.617	0.948	0.747
	4	30	30	1000/651	0.654	0.926	0.767
	5	30	30	1000/661	0.64	0.92	0.755
	6	30	30	996/684	0.611	0.909	0.731
	7	30	30	994/665	0.626	0.904	0.74
	8	30	30	1000/653	0.632	0.898	0.742
	9	30	30	999/575	0.704	0.88	0.783
	10	30	30	1000/579	0.718	0.904	0.801
	11	30	30	990/626	0.66	0.898	0.761
	12	30	30	996/626	0.655	0.891	0.755
	13	30	30	988/648	0.63	0.887	0.736
	14	30	30	994/601	0.676	0.883	0.765
	15	30	30	990/611	0.661	0.878	0.754
	16	30	30	992/618	0.629	0.846	0.722
	17	30	30	986/594	0.66	0.852	0.744
	18	30	30	990/592	0.632	0.813	0.711
	19	30	30	988/590	0.632	0.811	0.71
	20	30	30	1000/579	0.63	0.793	0.703
NH	2	30	30	1000/498	0.5	0.523	0.511
	3	30	30	999/610	0.462	0.592	0.519
	4	30	30	1000/917	0.447	0.861	0.589
	5	30	30	1000/705	0.44	0.651	0.525
	6	30	30	996/625	0.509	0.668	0.578
	7	30	30	994/757	0.46	0.731	0.564
	8	30	30	1000/697	0.482	0.706	0.573
	9	30	30	999/697	0.492	0.721	0.585
	10	30	30	1000/632	0.516	0.685	0.588
	11	30	30	990/757	0.448	0.712	0.55
	12	30	30	996/715	0.487	0.731	0.584
	13	30	30	988/733	0.475	0.731	0.576
	14	30	30	994/745	0.454	0.71	0.554
	15	30	30	990/728	0.47	0.718	0.568
	16	30	30	992/731	0.453	0.695	0.548
	17	30	30	986/720	0.454	0.687	0.547
	18	30	30	990/709	0.461	0.687	0.552
	19	30	30	988/732	0.445	0.685	0.54
	20	30	30	1000/659	0.483	0.668	0.56

Table A2. Regularization parameter λ, ranging from 1 to 35.

Dataset	Regularization λ	Number of Topics k	Number of Seeds	Number of Extracted Words	Precision	Recall	F1-Score
GH	1	10	30	1000/627	0.4	0.546	0.462
	5	10	30	1000/695	0.548	0.828	0.66
	10	10	30	1000/650	0.629	0.889	0.737
	15	10	30	1000/640	0.638	0.887	0.742
	20	10	30	1000/595	0.686	0.887	0.773
	25	10	30	1000/595	0.692	0.896	0.781
	30	10	30	1000/591	0.695	0.893	0.782
	35	10	30	1000/593	0.696	0.898	0.784
NH	1	10	30	1000/299	0.622	0.391	0.48
	5	10	30	1000/521	0.528	0.578	0.552
	10	10	30	1000/598	0.505	0.634	0.562
	15	10	30	1000/634	0.503	0.67	0.575
	20	10	30	1000/708	0.49	0.729	0.586
	25	10	30	1000/689	0.498	0.721	0.589
	30	10	30	1000/692	0.503	0.731	0.596
	35	10	30	1000/692	0.5	0.727	0.592

Table A3. Number of extracted top words T, ranging from 50 to 200.

Dataset	Extracted Top Words T	Regularization λ	Number of Topics k	Number of Seeds	Number of Extracted Words	Precision	Recall	F1-Score
GH	50	30	10	30	500/398	0.854	0.739	0.793
	60	30	10	30	600/455	0.807	0.798	0.802
	70	30	10	30	700/498	0.777	0.841	0.808
	80	30	10	30	800/539	0.742	0.87	0.801
	90	30	10	30	900/581	0.711	0.898	0.793
	100	30	10	30	1000/613	0.68	0.907	0.777
	110	30	10	30	1100/642	0.659	0.92	0.768
	120	30	10	30	1200/673	0.633	0.926	0.752
	130	30	10	30	1300/704	0.608	0.93	0.735
	140	30	10	30	1400/746	0.582	0.943	0.72
	150	30	10	30	1500/766	0.569	0.948	0.711
	160	30	10	30	1600/800	0.546	0.95	0.694
	170	30	10	30	1700/826	0.531	0.954	0.683
	180	30	10	30	1800/865	0.509	0.957	0.664
	190	30	10	30	1900/901	0.491	0.961	0.65
	200	30	10	30	2000/936	0.473	0.963	0.635
NH	50	30	10	30	500/464	0.506	0.494	0.5
	60	30	10	30	600/496	0.538	0.561	0.549
	70	30	10	30	700/514	0.535	0.578	0.556
	80	30	10	30	800/577	0.51	0.618	0.558
	90	30	10	30	900/634	0.494	0.658	0.564
	100	30	10	30	1000/688	0.483	0.697	0.57
	110	30	10	30	1100/734	0.473	0.729	0.574
	120	30	10	30	1200/776	0.464	0.756	0.575
	130	30	10	30	1300/818	0.454	0.779	0.573
	140	30	10	30	1400/857	0.443	0.798	0.57
	150	30	10	30	1500/813	0.458	0.782	0.577
	160	30	10	30	1600/848	0.45	0.803	0.577
	170	30	10	30	1700/881	0.442	0.817	0.573
	180	30	10	30	1800/912	0.431	0.826	0.566
	190	30	10	30	1900/942	0.426	0.842	0.566
	200	30	10	30	2000/968	0.419	0.853	0.562

Table A4. Number of seeds, ranging from 20 to 30.

Dataset	Number of Seeds	Regularization λ	Number of Topics k	Number of Extracted Words	Precision	Recall	F1-Score
GH	20	30	10	1000/624	0.631	0.857	0.727
	21	30	10	1000/598	0.664	0.863	0.75
	22	30	10	1000/613	0.635	0.846	0.725
	23	30	10	1000/627	0.632	0.861	0.729
	24	30	10	1000/611	0.668	0.887	0.762
	25	30	10	1000/556	0.718	0.867	0.785
	26	30	10	1000/583	0.69	0.874	0.771
	27	30	10	1000/625	0.659	0.896	0.759
	28	30	10	1000/577	0.719	0.902	0.8
	29	30	10	1000/614	0.679	0.907	0.777
	30	30	10	1000/613	0.68	0.907	0.777
NH	20	30	10	1000/691	0.47	0.683	0.557
	21	30	10	1000/728	0.442	0.676	0.535
	22	30	10	1000/712	0.426	0.637	0.51
	23	30	10	1000/714	0.431	0.647	0.518
	24	30	10	1000/656	0.485	0.668	0.562
	25	30	10	1000/762	0.419	0.67	0.515
	26	30	10	1000/744	0.458	0.716	0.559
	27	30	10	1000/658	0.473	0.653	0.549
	28	30	10	1000/771	0.454	0.735	0.561
	29	30	10	1000/832	0.431	0.754	0.549
	30	30	10	1000/632	0.516	0.685	0.588

References

QasemiZadeh, B. Investigating the Use of Distributional Semantic Models for Co-Hyponym Identification in Special Corpora. Ph.D. Thesis, National University of Ireland, Galway, Ireland, 2015. [Google Scholar]
Drouin, P.; Grabar, N.; Hamon, T.; Kageura, K.; Takeuchi, K. Computational terminology and filtering of terminological information: Introduction to the special issue. Terminology 2018, 24, 1–6. [Google Scholar]
Fusco, F.; Staar, P.; Antognini, D. Unsupervised Term Extraction for Highly Technical Domains. arXiv 2022, arXiv:2210.13118. [Google Scholar]
Lang, C.; Wachowiak, L.; Heinisch, B.; Gromann, D. Transforming term extraction: Transformer-based approaches to multilingual term extraction across domains. Find. Assoc. Comput. Linguist. ACL-IJCNLP 2021, 2021, 3607–3620. [Google Scholar]
Terryn, A.R.; Hoste, V.; Lefever, E. HAMLET: Hybrid adaptable machine learning approach to extract terminology. Terminol. Int. J. Theor. Appl. Issues Spec. Commun. 2021, 27, 254–293. [Google Scholar] [CrossRef]
Hazem, A.; Bouhandi, M.; Boudin, F.; Daille, B. Cross-lingual and cross-domain transfer learning for automatic term extraction from low resource data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 648–662. [Google Scholar]
Vukovic, R.; Heck, M.; Ruppik, B.M.; van Niekerk, C.; Zibrowius, M.; Gašić, M. Dialogue term extraction using transfer learning and topological data analysis. arXiv 2022, arXiv:2208.10448. [Google Scholar]
Qin, Y.; Zheng, D.; Zhao, T.; Zhang, M. Chinese terminology extraction using EM-based transfer learning method. In Computational Linguistics and Intelligent Text Proceedings of the 14th International Conference, CICLing 2013, Samos, Greece, 24–30 March 2013; Part I; Springer: Berlin/Heidelberg, Germany, 2013; pp. 139–152. [Google Scholar] [CrossRef]
Nugumanova, A.; Akhmed-Zaki, D.; Mansurova, M.; Baiburin, Y.; Maulit, A. NMF-based approach to automatic term extraction. Expert Syst. Appl. 2022, 199, 117179. [Google Scholar] [CrossRef]
Febrissy, M.; Salah, A.; Ailem, M.; Nadif, M. Improving NMF clustering by leveraging contextual relationships among words. Neurocomputing 2022, 495, 105–117. [Google Scholar] [CrossRef]
Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Proceedings of the Neural Information Processing Systems (NIPS), Denver, CO, USA, 1 January 2000; pp. 556–562. [Google Scholar]
Gao, J.; He, D.; Tan, X.; Qin, T.; Wang, L.; Liu, T.Y. Representation degeneration problem in training natural language generation models. arXiv 2019, arXiv:1907.12009. [Google Scholar]
Grootendorst, M. KeyBERT: Minimal keyword extraction with BERT. Zenodo. 2020. Version 0.8.0. Available online: https://zenodo.org/records/8388690 (accessed on 29 April 2024).
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Lee, H.; Yoo, J.; Choi, S. Semi-supervised nonnegative matrix factorization. IEEE Signal Process. Lett. 2009, 17, 4–7. [Google Scholar] [CrossRef]
Shen, B.; Makhambetov, O. Hierarchical semi-supervised factorization for learning the semantics. J. Adv. Comput. Intell. Intell. Inform. 2014, 18, 366–374. [Google Scholar] [CrossRef]
Vangara, R.; Skau, E.; Chennupati, G.; Djidjev, H.; Tierney, T.; Smith, J.P.; Bhattarai, M.; Stanev, V.G.; Alexandrov, B.S. Semantic nonnegative matrix factorization with automatic model determination for topic modeling. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020; pp. 328–335. [Google Scholar] [CrossRef]
Vangara, R.; Bhattarai, M.; Skau, E.; Chennupati, G.; Djidjev, H.; Tierney, T.; Smith, J.P.; Stanev, V.G.; Alexandrov, B.S. Finding the number of latent topics with semantic non-negative matrix factorization. IEEE Access 2021, 9, 117217–117231. [Google Scholar] [CrossRef]
Eren, M.E.; Solovyev, N.; Bhattarai, M.; Rasmussen, K.Ø.; Nicholas, C.; Alexandrov, B.S. SeNMFk-split: Large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection. In Proceedings of the 22nd ACM Symposium on Document Engineering, San Jose, CA, USA, 20–23 September 2022; pp. 1–4. [Google Scholar] [CrossRef]
Budahazy, R.; Cheng, L.; Huang, Y.; Johnson, A.; Li, P.; Vendrow, J.; Wu, Z.; Molitor, D.; Rebrova, E.; Needell, D. Analysis of Legal Documents via Non-negative Matrix Factorization Methods. arXiv 2021, arXiv:2104.14028. [Google Scholar]
Vendrow, J.; Haddock, J.; Rebrova, E.; Needell, D. On a guided nonnegative matrix factorization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3265–32369. [Google Scholar] [CrossRef]
Li, P.; Tseng, C.; Zheng, Y.; Chew, J.A.; Huang, L.; Jarman, B.; Needell, D. Guided semi-supervised non-negative matrix factorization. Algorithms 2022, 15, 136. [Google Scholar] [CrossRef]
Kuang, D.; Yun, S.; Park, H. SymNMF: Nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Glob. Optim. 2015, 62, 545–574. [Google Scholar] [CrossRef]
Jia, Y.; Liu, H.; Hou, J.; Kwong, S. Semisupervised adaptive symmetric non-negative matrix factorization. IEEE Trans. Cybern. 2020, 51, 2550–2562. [Google Scholar] [CrossRef] [PubMed]
Jing, L.; Yu, J.; Zeng, T.; Zhu, Y. Semi-supervised clustering via constrained symmetric non-negative matrix factorization. In Proceedings of the Brain Informatics: International Conference, Macau, China, 4–7 December 2012; pp. 309–319. [Google Scholar] [CrossRef]
Gadelrab, F.S.; Haggag, M.H.; Sadek, R.A. Novel semantic tagging detection algorithms based non-negative matrix factorization. SN Appl. Sci. 2020, 2, 54. [Google Scholar] [CrossRef]
Esposito, F. A review on initialization methods for nonnegative matrix factorization: Towards omics data experiments. Mathematics 2021, 9, 1006. [Google Scholar] [CrossRef]
Wild, S.; Curry, J.; Dougherty, A. Improving non-negative matrix factorizations through structured initialization. Pattern Recognit. 2004, 37, 2217–2232. [Google Scholar] [CrossRef]
Nannen, V. The Paradox of Overfitting. Master’s Thesis, Faculty of Science and Engineering, Rijksuniversiteit Groningen, Groningen, The Netherlands, 2003. Available online: https://fse.studenttheses.ub.rug.nl/id/eprint/8664 (accessed on 9 May 2024).
Pascual-Montano, A.; Carazo, J.M.; Lehmann, D.; Pascual-Marqui, R.D. Nonsmooth nonnegative matrix factorization (nsNMF). IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 403–415. [Google Scholar] [CrossRef]
Christopher; Manning, D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
Lopes, L.; Vieira, R.; Fernandes, P. Domain term relevance through tf-dcf. In Proceedings of the 2012 International Conference on Artificial Intelligence (ICAI), Las Vegas, NV, USA, 16–19 July 2012. [Google Scholar]

Figure 1. The pipeline of the proposed methodology for unsupervised term extraction.

Figure 2. NMF approximates document–word matrix

A

with the product of document–topic matrix

W

and topic–word matrix

H

.

Figure 2. NMF approximates document–word matrix

A

with the product of document–topic matrix

W

and topic–word matrix

H

.

Figure 3. The scheme for term extraction using the encoding vectors matrix

H

.

Figure 3. The scheme for term extraction using the encoding vectors matrix

H

.

Figure 4. The document–word matrix with reduced features.

Figure 5. The topic–word matrix, after transposition, is organized as follows: (a) sorted in descending order by coefficients in the Topic 1 column; (b) sorted in descending order by coefficients in the Topic 2 column.

Figure 6. Semantic NMF jointly approximates the document–word matrix A and the word–feature matrix M, sharing a common factor matrix H.

Figure 7. The word–word matrix M, representing similarities between words in FastText embeddings space.

Figure 8. The semantically enriched topic–word matrix, after transposition, is organized as follows: (a) sorted in descending order by coefficients in the Topic 1 column; (b) sorted in descending order by coefficients in the Topic 2 column.

Figure 9. The reduced word–seed matrix M, representing cosine similarities between all words and seed words.

Figure 10. Results of the semantic NMF using the word–seed matrix: (a) the word–seed matrix

M

; (b) the top 5 words for Topic 1; (c) the top 5 words for Topic 2.

Figure 10. Results of the semantic NMF using the word–seed matrix: (a) the word–seed matrix

M

; (b) the top 5 words for Topic 1; (c) the top 5 words for Topic 2.

Figure 11. Example of an expert’s request to ChatGPT-4 for the GH dataset.

Figure 12. Example of an expert’s request to ChatGPT-4 for the NH dataset.

Figure 13. (a) The term cloud for the GH dataset; (b) the term cloud for the NH dataset.

Figure 14. (a) Ranging number of topics k for GH; (b) ranging number of topics k for NH. The blue line represents the F1-score for each number of topics k, while the red reference line highlights the maximum F1-scores observed.

Figure 15. (a) Ranging regularization λ for GH; (b) ranging regularization λ for NH.

Figure 16. The comparison of the best F-scores.

Table 1. Original data from [27]: Document Set for Binary Classification.

	docID	Words in Document	In c = China?
Training set	1	Chinese Beijing Chinese	Yes
	2	Chinese Chinese Shanghai	Yes
	3	Chinese Macao	Yes
	4	Tokyo Japan Chinese	No
Test set	5	Chinese Chinese Chinese Tokyo Japan	?

Table 2. A fragment of the example corpus.

No.	Document
1	Beijing, an ancient Chinese capital, is located 75 km away from the Great Wall of China.
2	Although Shanghai is not the capital of China, it is the main Chinese metropolis.
3	Macau, an autonomous territory, uniquely blends its rich cultural heritage within China, showcasing a distinct yet harmonious identity.
4	Nanjing, being a significant center for Buddhism in China, contributed to the spread of Buddhism to Japan.
5	Guangzhou, a major city in Southern China, is famed for its modern architecture and rich Cantonese heritage.
6	Over centuries, China and Japan have had extensive cultural exchanges.
7	Hokkaido, the northern island of Japan, is renowned for its stunning landscapes.
8	In Tokyo, the technology sector, known for its innovation and advancement, collaborates with Chinese expertise, contributing to the shared progress of Japan and its global partners.
9	Yokohama is the second largest city in Japan after Tokyo, with a population of 3.7 million people.
10	Yokohama Chinatown, notable for being the largest Chinese enclave in Japan, boasts a wide array of Chinese restaurants known for their exquisite cuisine.
11	Kyoto, once the capital of Japan, is celebrated for its historic temples and traditional culture.
12	Kyoto, the ancient capital of Japan before Tokyo, is renowned for its rich historical and cultural heritage.
13	Many temples and buildings in Kyoto were influenced by Chinese architecture.
14	Tokyo is a bustling metropolis, while Kyoto is a serene historical haven, both embodying diverse charm of Japan.

Table 3. Examples of sentences from both datasets.

Country	GH Dataset	NH Dataset
China	Macau, an autonomous territory, uniquely blends its rich cultural heritage within China, showcasing a distinct yet harmonious identity.	During Confucius’s lifetime, he traveled across various Chinese states, offering advice to rulers, which later formed the basis of Confucianism.
Egypt	Lake Nasser was created by the construction of the Aswan High Dam across the Nile River in southern Egypt in the 1960s.	Osiris was one of the most important gods in ancient Egyptian religion, associated with kingship, death, and the afterlife.
Greece	Greece is known as the cradle of Western civilization, with a rich history that spans thousands of years.	In Athens, the philosopher Socrates challenged traditional notions of ethics and wisdom, sparking intellectual revolutions across Greece.
Iran	Tehran serves as the political, cultural, economic, and industrial center of Iran	Omar Khayyam’s contributions to mathematics and astronomy were highly regarded in cities like Nishapur, Isfahan, and Baghdad, where he studied and worked.
Japan	Yokohama is the second largest city in Japan after Tokyo, with a population of 3.7 million people.	Miyazaki’s love for nature is evident in many of his films, drawing inspiration from the lush forests of Yakushima Island.
Kazakhstan	The Irtysh is one of the longest rivers in Asia, flowing through China, Kazakhstan, and Russia.	Abai, the great Kazakh poet, was born in the village located in the Semey region of Kazakhstan.
Mongolia	The Gandan Monastery, located in Ulaanbaatar, is one of Mongolia’s most important Buddhist monasteries and a center for religious and cultural activities.	Subutai’s military genius made him one of the most feared and respected commanders of his time, leaving a legacy in Mongol and military history.
Russia	The Volga River is the longest river in Europe, flowing through central Russia for over 3500 km.	Turgenev’s masterpiece, “Fathers and Sons”, reflects the tensions between generations and the changing social landscape of 19th-century Russia.
Turkey	The capital and second-largest city of Turkey is Ankara.	Sultan Suleiman was born in 1494 in Trabzon, a city located on the northeastern coast of Turkey.
Uzbekistan	Khiva is an ancient city located in the western part of Uzbekistan, in the region of Khorezm.	Khiva, located along the Silk Road, was a hub of trade and exchange, influencing Al-Khwarizmi’s understanding of geography and navigation.

Table 4. Datasets statistics.

Country	GH Dataset				NH Dataset
Country	Number of Sentences	Average Length of Sentences	Number of Terms Occurrences	Number of Unique Terms	Number of Sentences	Average Length of Sentences	Number of Terms Occurrences	Number of Unique Terms
China	100	20	302	85	100	20	166	40
Egypt	100	21	294	58	100	22	118	36
Greece	100	21	283	65	100	22	123	54
Iran	100	17	281	66	100	22	171	63
Japan	100	20	235	46	100	19	131	35
Kazakhstan	100	17	390	62	100	22	143	59
Mongolia	100	18	360	59	100	23	174	57
Russia	100	26	295	72	100	21	152	58
Turkey	100	17	271	57	100	20	161	46
Uzbekistan	100	23	436	83	100	22	226	47
Total	1000	-	3147	653/461	1000	-	1183	495/476

Note: The last row ‘total‘ and corresponding numbers are bold to highlight summary statistics.

Table 5. The top 10 words for the GH and NH datasets.

	GH Dataset		NH Dataset
	Word	Frequency	Word	Frequency
1	City	136	Born	159
2	Located	119	City	116
3	Known	115	Ancient	99
4	River	102	One	81
5	Ancient	99	Known	71
6	One	93	Located	58
7	Region	90	Around	52
8	Sea	90	Chinese	51
9	Cultural	88	Legacy	48
10	Including	79	Kazakh	48

Table 6. Seeds for semantic NMF.

Class (Country)	GH Dataset		NH Dataset
Class (Country)	2 Seed Words per Class	3 Seed Words per Class	2 Seed Words per Class	3 Seed Words per Class
China	China, Beijing	China, Beijing, Shanghai	Confucius, Mao	Confucius, Mao, Laozi
Egypt	Egypt, Cairo	Egypt, Cairo, Alexandria	Cleopatra, Osiris	Cleopatra, Osiris, Imhotep
Greece	Greece, Athens	Greece, Athens, Olympus	Zeus, Socrates	Zeus, Socrates, Aristotle
Iran	Iran, Isfahan	Iran, Tehran, Isfahan	Omar, Avicenna	Omar, Avicenna, Hafez
Japan	Japan, Tokyo	Japan, Kyoto, Tokyo	Toyoda, Hokusai	Toyoda, Hokusai, Hanyu
Kazakhstan	Kazakh, Almaty	Kazakh, Almaty, Astana	Abai, al-Farabi	Abai, al-Farabi, Tomyris
Mongolia	Mongolia, Ulaanbaatar	Mongolia, Ulaanbaatar, Gobi	Chingis, Chagatai	Chingis, Chagatai, Kublai
Russia	Russia, Moscow	Russia, Moscow, Petersburg	Pushkin, Gagarin	Pushkin, Gagarin, Lomonosov
Turkey	Turkey, Istanbul	Turkey, Istanbul, Ankara	Suleiman, Erdogan	Suleiman, Erdogan, Ataturk
Uzbekistan	Uzbek, Tashkent	Uzbek, Tashkent, Samarkand	al-Khwarizmi, Babur	al-Khwarizmi, Babur, Navoi

Table 7. Overview of experimental setups for semantic NMF.

Setup	$Number of Topics k$	Regularization λ	$Number of Extracted Top Words T$	Number of Seeds	Use of Anti-Seeds
1	10	30	70	20	No
2	10	30	70	30	No
3	10	30	70	20	Yes
4	10	30	70	30	Yes

Table 8. Results of baseline methods.

Dataset	Baseline Methods	Number of Extracted Words	Number of True Terms	Precision	Recall	F1-Score
GH	Standard NMF	700/440	462	29.8	28.5	29.1
	BERTopic	700/495		42.6	45.9	44.2
	KeyBERT	2000/633		52	71.5	60.2
NH	Standard NMF	700/488	476	19.1	19.5	19.3
	BERTopic	700/568		25.7	30.7	28
	KeyBERT	2000/853		44.9	80.5	57.6

Note: Best results are highlighted in bold.

Table 9. Results for each experimental setup of semantic NMF.

Dataset	Setup	Number of Extracted Words	Number of True Terms	Precision	Recall	F1-Score
GH	1 (20 + no)	700/406	462	77.8	68.7	73
	2 (30 + no)	700/459		80	79.8	79.9
	3 (20 + yes)	700/371		84.4	68	75.3
	4 (30 + yes)	700/453		80.8	79.6	80.2
NH	1 (20 + no)	700/585	476	46.7	57.4	51.5
	2 (30 + no)	700/609		48.1	61.6	54
	3 (20 + yes)	700/526		51.9	57.4	54.5
	4 (30 + yes)	700/551		52.5	60.7	56.3

Note: Best results are highlighted in bold.

Table 10. The best results of semantic NMF compared to the best results of KeyBERT.

Approach	GH			NH
Approach	Precision	Recall	F1-Score	Precision	Recall	F1-Score
Semantic NMF	80.8	79.6	80.2	47.4	77.7	58.9
KeyBERT	52	71.5	60.2	44.9	80.5	57.6
Standard NMF	29.8	28.5	29.1	17.2	48.9	25.5
BERTopic	42.6	45.9	44.2	22.3	49.2	30.7

Note: Best F1-Scores are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nugumanova, A.; Alzhanov, A.; Mansurova, A.; Rakhymbek, K.; Baiburin, Y. Semantic Non-Negative Matrix Factorization for Term Extraction. Big Data Cogn. Comput. 2024, 8, 72. https://doi.org/10.3390/bdcc8070072

AMA Style

Nugumanova A, Alzhanov A, Mansurova A, Rakhymbek K, Baiburin Y. Semantic Non-Negative Matrix Factorization for Term Extraction. Big Data and Cognitive Computing. 2024; 8(7):72. https://doi.org/10.3390/bdcc8070072

Chicago/Turabian Style

Nugumanova, Aliya, Almas Alzhanov, Aiganym Mansurova, Kamilla Rakhymbek, and Yerzhan Baiburin. 2024. "Semantic Non-Negative Matrix Factorization for Term Extraction" Big Data and Cognitive Computing 8, no. 7: 72. https://doi.org/10.3390/bdcc8070072

Article Menu

Semantic Non-Negative Matrix Factorization for Term Extraction

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Standard NMF for Term Extraction

3.2. Semantic NMF for Term Extraction

3.3. Constructing the Word–Seed Matrix for Semantic NMF

4. Experiments and Results

4.1. Datasets

4.2. Experimental Settings

4.2.1. Selection of the Number of Seeds

4.2.2. Selection of the Number of Topics k

4.2.3. Selection of the Regularization Parameter λ

4.2.4. Selection of the Number of Top Words T

4.3. Baselines

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Ablation Study of Number of Topics k, Regularization Parameter λ, Number of Extracted Top Words T and Number of Seeds

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI