MDPI - Publisher of Open Access Journals

15 pages, 1346 KB

Open AccessArticle

Using Social Media Listening to Characterize the Flare Lexicon in Patients with Sjögren’s Disease

by Chiara Baldini, Maurice Flurie, Zachary Cline, Colton Flowers, Coralie Peter Bouillot, Linda J. Stone, Lauren Dougherty, Christopher DeFelice and Maria Picone

Rheumato 2025, 5(4), 14; https://doi.org/10.3390/rheumato5040014 - 26 Sep 2025

Viewed by 318

Abstract

Background/Objectives: Sjögren’s disease (SjD) flares are incompletely understood. The patient perspective is critical to closing this gap. This retrospective social media listening (SML) study characterized the flare lexicon within the online Reddit SjD community using novel machine learning and natural language processing. Methods: [...] Read more.

Background/Objectives: Sjögren’s disease (SjD) flares are incompletely understood. The patient perspective is critical to closing this gap. This retrospective social media listening (SML) study characterized the flare lexicon within the online Reddit SjD community using novel machine learning and natural language processing. Methods: Documents (posts/comments) were analyzed from the subreddit group “r/Sjogrens” (October 2012 to August 2023). Outcomes were as follows: (1) Frequency of documents mentioning flare, and contexts in which flare was mentioned; (2) clinical concepts associated with flare (analyzed using co-occurrence and pointwise mutual information [PMI]); (3) proportion of flare vs. non-flare documents relevant to SYMPTOMS or TESTING (compared using a two-proportion z-test); and (4) primary emotions mentioned in flare documents. Results: Of 59,266 documents with 5025 authors, flare was mentioned 3330 times (4.4% of documents from 19.1% of authors). Flare was discussed as a symptom (1423 instances), disease (13), or with no clinical category (1890). Flare-associated clinical concepts (co-occurrence > 100 and PMI² > 3) included SYMPTOMS (pain, fatigue, dryness of eye, xerostomia, arthralgia, stress) and BODY PARTS (eye, mouth, joints, whole body). More flare vs. non-flare documents mentioned a SYMPTOM, whereas fewer mentioned a TEST (p < 0.001 for both). Within flare documents, 36.5% expressed emotions, primarily fear (40.5% of primary emotions), happiness (17.8%), sadness (15.7%), and anger (15.5%). Conclusions: The SjD community discusses flare frequently and in context with symptoms, specifically pain, eye and mouth dryness, and fatigue. Flare conversations frequently involve negative emotions. Additional research is required to clarify the patient experience of flare, its clinical parameters, and implications. Full article

► Show Figures

Figure 1

13 pages, 282 KB

Open AccessFeature PaperArticle

Information Exchange Fluctuation Theorem Under Coarse-Graining

by Lee Jinwoo

Mathematics 2025, 13(16), 2607; https://doi.org/10.3390/math13162607 - 14 Aug 2025

Viewed by 314

Abstract

The fluctuation theorem for information exchange, originally established by Sagawa and Ueda, provides a fundamental framework for understanding the role of correlations in coupled classical stochastic systems. Building upon this foundation, Jinwoo demonstrated that the pointwise mutual information between correlated subsystems captures entropy [...] Read more.

The fluctuation theorem for information exchange, originally established by Sagawa and Ueda, provides a fundamental framework for understanding the role of correlations in coupled classical stochastic systems. Building upon this foundation, Jinwoo demonstrated that the pointwise mutual information between correlated subsystems captures entropy production as a state function during coupling processes. In this study, we investigate the robustness of this information-theoretic fluctuation theorem under coarse-graining in coupled classical fluctuating systems. We rigorously prove that the fluctuation theorem remains invariant under arbitrary coarse-graining transformations and derive hierarchical relationships between information measures across different scales, thereby establishing its fundamental character as independent of the level of system description. Our results demonstrate that the relationship between information exchange and entropy production is preserved across different scales of observation, providing deeper insights into the thermodynamic foundations of information processing in classical stochastic systems. Full article

20 pages, 547 KB

Open AccessArticle

Fine-Grained Semantics-Enhanced Graph Neural Network Model for Person-Job Fit

by Xia Xue, Jingwen Wang, Bo Ma, Jing Ren, Wujie Zhang, Shuling Gao, Miao Tian, Yue Chang, Chunhong Wang and Hongyu Wang

Entropy 2025, 27(7), 703; https://doi.org/10.3390/e27070703 - 30 Jun 2025

Viewed by 733

Abstract

Online recruitment platforms are transforming talent acquisition paradigms, where a precise person-job fit plays a pivotal role in intelligent recruitment systems. However, current methodologies predominantly rely on coarse-grained semantic analysis, failing to address the textual structural dependencies and noise inherent in resumes and [...] Read more.

Online recruitment platforms are transforming talent acquisition paradigms, where a precise person-job fit plays a pivotal role in intelligent recruitment systems. However, current methodologies predominantly rely on coarse-grained semantic analysis, failing to address the textual structural dependencies and noise inherent in resumes and job descriptions. To bridge this gap, the novel fine-grained semantics-enhanced graph neural network for person-job fit (FSEGNN-PJF) framework is proposed. First, graph topologies are constructed by modeling word co-occurrence relationships through pointwise mutual information and sliding windows, followed by graph attention networks to learn graph structural semantics. Second, to mitigate textual noise and focus on critical features, a differential transformer and self-attention mechanism are introduced to semantically encode resumes and job requirements. Then, a novel fine-grained semantic matching strategy is designed, using the enhanced feature fusion strategy to fuse the semantic features of resumes and job positions. Extensive experiments on real-world recruitment datasets demonstrate the effectiveness and robustness of FSEGNN-PJF. Full article

(This article belongs to the Section Multidisciplinary Applications)

► Show Figures

Figure 1

29 pages, 2368 KB

Open AccessArticle

Chinese “Dialects” and European “Languages”: A Comparison of Lexico-Phonetic and Syntactic Distances

by Chaoju Tang, Vincent J. van Heuven, Wilbert Heeringa and Charlotte Gooskens

Languages 2025, 10(6), 127; https://doi.org/10.3390/languages10060127 - 29 May 2025

Cited by 1 | Viewed by 4129

Abstract

In this article, we tested some specific claims made in the literature on relative distances among European languages and among Chinese dialects, suggesting that some language varieties within the Sinitic family traditionally called dialects are, in fact, more linguistically distant from one another [...] Read more.

In this article, we tested some specific claims made in the literature on relative distances among European languages and among Chinese dialects, suggesting that some language varieties within the Sinitic family traditionally called dialects are, in fact, more linguistically distant from one another than some European varieties that are traditionally called languages. More generally, we examined whether distances among varieties within and across European language families were larger than those within and across Sinitic language varieties. To this end, we computed lexico-phonetic as well as syntactic distance measures for comparable language materials in six Germanic, five Romance and six Slavic languages, as well as for six Mandarin and nine non-Mandarin (‘southern’) Chinese varieties. Lexico-phonetic distances were expressed as the length-normalized MPI-weighted Levenshtein distances computed on the 100 most frequently used nouns in the 32 language varieties. Syntactic distance was implemented as the (complement of) the Pearson correlation coefficient found for the PoS trigram frequencies established for a parallel corpus of the same four texts translated into each of the 32 languages. The lexico-phonetic distances proved to be relatively large and of approximately equal magnitude in the Germanic, Slavic and non-Mandarin Chinese language varieties. However, the lexico-phonetic distances among the Romance and Mandarin languages were considerably smaller, but of similar magnitude. Cantonese (Guangzhou dialect) was lexico-phonetically as distant from Standard Mandarin (Beijing dialect) as European language pairs such as Portuguese–Italian, Portuguese–Romanian and Dutch–German. Syntactically, however, the differences among the Sinitic varieties were about ten times smaller than the differences among the European languages, both within and across the families—which provides some justification for the Chinese tradition of calling the Sinitic varieties dialects of the same language. Full article

(This article belongs to the Special Issue Dialectal Dynamics)

► Show Figures

Figure 1

22 pages, 3887 KB

Open AccessArticle

The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets

by Fernando Henrique Calderón Alvarado

Appl. Sci. 2025, 15(7), 3490; https://doi.org/10.3390/app15073490 - 22 Mar 2025

Viewed by 948

Abstract

This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore [...] Read more.

This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications. Full article

(This article belongs to the Special Issue Application of Affective Computing)

► Show Figures

Figure 1

17 pages, 3763 KB

Open AccessArticle

Graph-Based Feature Crossing to Enhance Recommender Systems

by Congyu Cai, Hong Chen, Yunxuan Liu, Daoquan Chen, Xiuze Zhou and Yuanguo Lin

Mathematics 2025, 13(2), 302; https://doi.org/10.3390/math13020302 - 18 Jan 2025

Cited by 2 | Viewed by 1725

Abstract

In recommendation tasks, most existing models that learn users’ preferences from user–item interactions ignore the relationships between items. Additionally, ensuring that the crossed features capture both global graph structures and local context is non-trivial, requiring innovative techniques for multi-scale representation learning. To overcome [...] Read more.

In recommendation tasks, most existing models that learn users’ preferences from user–item interactions ignore the relationships between items. Additionally, ensuring that the crossed features capture both global graph structures and local context is non-trivial, requiring innovative techniques for multi-scale representation learning. To overcome these difficulties, we develop a novel neural network, CoGraph, which uses a graph to build the relations between items. The item co-occurrence pattern assumes that certain items consistently appear in pairs in users’ viewing or consumption logs. First, to learn relationships between items, a graph whose distance is measured by Normalised Point-Wise Mutual Information (NPMI) is applied to link items for the co-occurrence pattern. Then, to learn as many useful features as possible for higher recommendation quality, a Convolutional Neural Network (CNN) and the Transformer model are used to parallelly learn local and global feature interactions. Finally, a series of comprehensive experiments were conducted on several public data sets to show the performance of our model. It provides valuable insights into the capability of our model in recommendation tasks and offers a viable pathway for the public data operation. Full article

(This article belongs to the Special Issue Advanced Research in Data-Centric AI)

► Show Figures

Figure 1

13 pages, 270 KB

Open AccessArticle

Exploring the Effects of Pre-Processing Techniques on Topic Modeling of an Arabic News Article Data Set

by Haya Alangari and Nahlah Algethami

Appl. Sci. 2024, 14(23), 11350; https://doi.org/10.3390/app142311350 - 5 Dec 2024

Cited by 2 | Viewed by 1857

Abstract

This research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Article Data set (SANAD), which [...] Read more.

This research investigates the impacts of pre-processing techniques on the effectiveness of topic modeling algorithms for Arabic texts, focusing on a comparison between BERTopic, Latent Dirichlet Allocation (LDA), and Non-Negative Matrix Factorization (NMF). Using the Single-label Arabic News Article Data set (SANAD), which includes 195,174 Arabic news articles, this study explores pre-processing methods such as cleaning, stemming, normalization, and stop word removal, which are crucial processes given the complex morphology of Arabic. Additionally, the influence of six different embedding models on the topic modeling performance was assessed. The originality of this work lies in addressing the lack of previous studies that optimize BERTopic through adjusting the n-gram range parameter and combining it with different embedding models for effective Arabic topic modeling. Pre-processing techniques were fine-tuned to improve data quality before applying BERTopic, LDA, and NMF, and the performance was assessed using metrics such as topic coherence and diversity. Coherence was measured using Normalized Pointwise Mutual Information (NPMI). The results show that the Tashaphyne stemmer significantly enhanced the performance of LDA and NMF. BERTopic, optimized with pre-processing and bi-grams, outperformed LDA and NMF in both coherence and diversity. The CAMeL-Lab/bert-base-arabic-camelbert-da embedding yielded the best results, emphasizing the importance of pre-processing in Arabic topic modeling. Full article

► Show Figures

Figure 1

27 pages, 3781 KB

Open AccessArticle

Spectral Clustering Community Detection Algorithm Based on Point-Wise Mutual Information Graph Kernel

by Yinan Chen, Wenbin Ye and Dong Li

Entropy 2023, 25(12), 1617; https://doi.org/10.3390/e25121617 - 3 Dec 2023

Cited by 2 | Viewed by 3331

Abstract

To address the problem that traditional spectral clustering algorithms cannot obtain the complete structural information of networks, this paper proposes a spectral clustering community detection algorithm, PMIK-SC, based on the point-wise mutual information (PMI) graph kernel. The kernel is constructed according to the [...] Read more.

To address the problem that traditional spectral clustering algorithms cannot obtain the complete structural information of networks, this paper proposes a spectral clustering community detection algorithm, PMIK-SC, based on the point-wise mutual information (PMI) graph kernel. The kernel is constructed according to the point-wise mutual information between nodes, which is then used as a proximity matrix to reconstruct the network and obtain the symmetric normalized Laplacian matrix. Finally, the network is partitioned by the eigendecomposition and eigenvector clustering of the Laplacian matrix. In addition, to determine the number of clusters during spectral clustering, this paper proposes a fast algorithm, BI-CNE, for estimating the number of communities. For a specific network, the algorithm first reconstructs the original network and then runs Monte Carlo sampling to estimate the number of communities by Bayesian inference. Experimental results show that the detection speed and accuracy of the algorithm are superior to other existing algorithms for estimating the number of communities. On this basis, the spectral clustering community detection algorithm PMIK-SC also has high accuracy and stability compared with other community detection algorithms and spectral clustering algorithms. Full article

(This article belongs to the Special Issue Community Detection and Clustering Complex Networks and Their Applications)

► Show Figures

Figure 1

27 pages, 491 KB

Open AccessArticle

Decomposing and Tracing Mutual Information by Quantifying Reachable Decision Regions

by Tobias Mages and Christian Rohner

Entropy 2023, 25(7), 1014; https://doi.org/10.3390/e25071014 - 30 Jun 2023

Cited by 1 | Viewed by 2377

Abstract

The idea of a partial information decomposition (PID) gained significant attention for attributing the components of mutual information from multiple variables about a target to being unique, redundant/shared or synergetic. Since the original measure for this analysis was criticized, several alternatives have been [...] Read more.

The idea of a partial information decomposition (PID) gained significant attention for attributing the components of mutual information from multiple variables about a target to being unique, redundant/shared or synergetic. Since the original measure for this analysis was criticized, several alternatives have been proposed but have failed to satisfy the desired axioms, an inclusion–exclusion principle or have resulted in negative partial information components. For constructing a measure, we interpret the achievable type I/II error pairs for predicting each state of a target variable (reachable decision regions) as notions of pointwise uncertainty. For this representation of uncertainty, we construct a distributive lattice with mutual information as consistent valuation and obtain an algebra for the constructed measure. The resulting definition satisfies the original axioms, an inclusion–exclusion principle and provides a non-negative decomposition for an arbitrary number of variables. We demonstrate practical applications of this approach by tracing the flow of information through Markov chains. This can be used to model and analyze the flow of information in communication networks or data processing systems. Full article

(This article belongs to the Special Issue Synergy and Redundancy Measures: Theory and Applications to Characterize Complex Systems and Shape Neural Network Representations)

► Show Figures

Figure 1

18 pages, 2905 KB

Open AccessArticle

Dynamic Characteristics and Evolution Analysis of Information Dissemination Theme of Social Networks under Emergencies

by Yuan Zhang, Yanxi Xie, Victor Shi and Ke Yin

Behav. Sci. 2023, 13(4), 282; https://doi.org/10.3390/bs13040282 - 24 Mar 2023

Cited by 9 | Viewed by 2685

Abstract

Social media has become an essential channel for the public to create and obtain information during emergencies. As the theme of public concern for emergencies changes over time, there is a lack of research on its dynamic evolution from its latent stage. This [...] Read more.

Social media has become an essential channel for the public to create and obtain information during emergencies. As the theme of public concern for emergencies changes over time, there is a lack of research on its dynamic evolution from its latent stage. This paper selects the Henan rainstorm event as a case study and extracts the theme characteristics by combining the life cycle theory and Latent Dirichlet Allocation (LDA) model. It integrates the Term Frequency–Inverse Document Frequency (TF-IDF) and Pointwise Mutual Information (PMI) algorithms as the theme-coding data source to build a dynamic theme propagation model for emergencies. Our research results showed that the theme coding effectively verified the assumption of latent development trends. The dynamic theme model could reveal the theme characteristics of different time series stages of emergencies, analyze the law of the theme evolution of the network’s public opinion, and provide practical and theoretical insights for the emergency management of urban cities. Full article

► Show Figures

Figure 1

17 pages, 1464 KB

Open AccessArticle

Modeling Topics in DFA-Based Lemmatized Gujarati Text

by Uttam Chauhan, Shrusti Shah, Dharati Shiroya, Dipti Solanki, Zeel Patel, Jitendra Bhatia, Sudeep Tanwar, Ravi Sharma, Verdes Marina and Maria Simona Raboaca

Sensors 2023, 23(5), 2708; https://doi.org/10.3390/s23052708 - 1 Mar 2023

Cited by 6 | Viewed by 2709

Abstract

Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model’s topic is expected to be interpretable as a concept, [...] Read more.

Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model’s topic is expected to be interpretable as a concept, i.e., correspond to human understanding of a topic occurring in texts. While discovering corpus themes, inference constantly uses vocabulary that impacts topic quality due to its size. Inflectional forms are in the corpus. Since words frequently appear in the same sentence and are likely to have a latent topic, practically all topic models rely on co-occurrence signals between various terms in the corpus. The topics get weaker because of the abundance of distinct tokens in languages with extensive inflectional morphology. Lemmatization is often used to preempt this problem. Gujarati is one of the morphologically rich languages, as a word may have several inflectional forms. This paper proposes a deterministic finite automaton (DFA) based lemmatization technique for the Gujarati language to transform lemmas into their root words. The set of topics is then inferred from this lemmatized corpus of Gujarati text. We employ statistical divergence measurements to identify semantically less coherent (overly general) topics. The result shows that the lemmatized Gujarati corpus learns more interpretable and meaningful subjects than unlemmatized text. Finally, results show that lemmatization curtails the size of vocabulary decreases by 16% and the semantic coherence for all three measurements—Log Conditional Probability, Pointwise Mutual Information, and Normalized Pointwise Mutual Information—from −9.39 to −7.49, −6.79 to −5.18, and −0.23 to −0.17, respectively. Full article

(This article belongs to the Special Issue Application of Semantic Technologies in Sensors and Sensing Systems)

► Show Figures

Figure 1

15 pages, 878 KB

Open AccessEditor’s ChoiceArticle

CoNet: Efficient Network Regression for Survival Analysis in Transcriptome-Wide Association Studies—With Applications to Studies of Breast Cancer

by Jiayi Han, Liye Zhang, Ran Yan, Tao Ju, Xiuyuan Jin, Shukang Wang, Zhongshang Yuan and Jiadong Ji

Genes 2023, 14(3), 586; https://doi.org/10.3390/genes14030586 - 25 Feb 2023

Viewed by 2190

Abstract

Transcriptome-wide association studies (TWASs) aim to detect associations between genetically predicted gene expression and complex diseases or traits through integrating genome-wide association studies (GWASs) and expression quantitative trait loci (eQTL) mapping studies. Most current TWAS methods analyze one gene at a time, ignoring [...] Read more.

Transcriptome-wide association studies (TWASs) aim to detect associations between genetically predicted gene expression and complex diseases or traits through integrating genome-wide association studies (GWASs) and expression quantitative trait loci (eQTL) mapping studies. Most current TWAS methods analyze one gene at a time, ignoring the correlations between multiple genes. Few of the existing TWAS methods focus on survival outcomes. Here, we propose a novel method, namely a COx proportional hazards model for NEtwork regression in TWAS (CoNet), that is applicable for identifying the association between one given network and the survival time. CoNet considers the general relationship among the predicted gene expression as edges of the network and quantifies it through pointwise mutual information (PMI), which is under a two-stage TWAS. Extensive simulation studies illustrate that CoNet can not only achieve type I error calibration control in testing both the node effect and edge effect, but it can also gain more power compared with currently available methods. In addition, it demonstrates superior performance in real data application, namely utilizing the breast cancer survival data of UK Biobank. CoNet effectively accounts for network structure and can simultaneously identify the potential effecting nodes and edges that are related to survival outcomes in TWAS. Full article

(This article belongs to the Special Issue Genetics of Complex Human Disease)

► Show Figures

Figure 1

25 pages, 5239 KB

Open AccessArticle

Novel Asymmetric Pyramid Aggregation Network for Infrared Dim and Small Target Detection

by Guangrui Lv, Lili Dong, Junke Liang and Wenhai Xu

Remote Sens. 2022, 14(22), 5643; https://doi.org/10.3390/rs14225643 - 8 Nov 2022

Cited by 7 | Viewed by 2570

Abstract

Robust and efficient detection of small infrared target is a critical and challenging task in infrared search and tracking applications. The size of the small infrared targets is relatively tiny compared to the ordinary targets, and the sizes and appearances of the these [...] Read more.

Robust and efficient detection of small infrared target is a critical and challenging task in infrared search and tracking applications. The size of the small infrared targets is relatively tiny compared to the ordinary targets, and the sizes and appearances of the these targets in different scenarios are quite different. Besides, these targets are easily submerged in various background noise. To tackle the aforementioned challenges, a novel asymmetric pyramid aggregation network (APANet) is proposed. Specifically, a pyramid structure integrating dual attention and dense connection is firstly constructed, which can not only generate attention-refined multi-scale features in different layers, but also preserve the primitive features of infrared small targets among multi-scale features. Then, the adjacent cross-scale features in these multi-scale information are sequentially modulated through pair-wise asymmetric combination. This mutual dynamic modulation can continuously exchange heterogeneous cross-scale information along the layer-wise aggregation path until an inverted pyramid is generated. In this way, the semantic features of lower-level network are enriched by incorporating local focus from higher-level network while the detail features of high-level network are refined by embedding point-wise focus from lower-level network, which can highlight small target features and suppress background interference. Subsequently, recursive asymmetric fusion is designed to further dynamically modulate and aggregate high resolution features of different layers in the inverted pyramid, which can also enhance the local high response of small target. Finally, a series of comparative experiments are conducted on two public datasets, and the experimental results show that the APANet can more accurately detect small targets compared to some state-of-the-art methods. Full article

(This article belongs to the Special Issue Deep Learning Based Target Detection and Recognition in Remote Sensing Images)

► Show Figures

Graphical abstract

21 pages, 1407 KB

Open AccessArticle

Parallel Corpus Research and Target Language Representativeness: The Contrastive, Typological, and Translation Mining Traditions

by Bert Le Bruyn, Martín Fuchs, Martijn van der Klis, Jianan Liu, Chou Mo, Jos Tellings and Henriëtte de Swart

Languages 2022, 7(3), 176; https://doi.org/10.3390/languages7030176 - 7 Jul 2022

Cited by 11 | Viewed by 7217

Abstract

This paper surveys the strategies that the Contrastive, Typological, and Translation Mining parallel corpus traditions rely on to deal with the issue of target language representativeness of translations. On the basis of a comparison of the corpus architectures and research designs of the [...] Read more.

This paper surveys the strategies that the Contrastive, Typological, and Translation Mining parallel corpus traditions rely on to deal with the issue of target language representativeness of translations. On the basis of a comparison of the corpus architectures and research designs of the three traditions, we argue that they have each developed their own representativeness strategies: (i) monolingual control corpora (Contrastive tradition), (ii) limits on the scope of research questions (Typological tradition), and (iii) parallel control corpora (Translation Mining tradition). We introduce normalized pointwise mutual information (NPMI) as a bi-directional measure of cross-linguistic association, allowing for an easy comparison of the outcomes of different traditions and the impact of the monolingual and parallel control corpus representativeness strategies. We further argue that corpus size has a major impact on the reliability of the monolingual control corpus strategy and that a sequential parallel control corpus strategy is preferable for smaller corpora. Full article

(This article belongs to the Special Issue Tense and Aspect Across Languages)

► Show Figures

Figure 1

15 pages, 1107 KB

Open AccessArticle

A Method of Domain Dictionary Construction for Electric Vehicles Disassembly

by Wei Ren, Hengwei Zhang and Ming Chen

Entropy 2022, 24(3), 363; https://doi.org/10.3390/e24030363 - 3 Mar 2022

Cited by 7 | Viewed by 2989

Abstract

Currently, there is no domain dictionary in the field of electric vehicles disassembly and other domain dictionary construction algorithms do not accurately extract terminology from disassembly text, because the terminology is complex and variable. Herein, the construction of a domain dictionary for the [...] Read more.

Currently, there is no domain dictionary in the field of electric vehicles disassembly and other domain dictionary construction algorithms do not accurately extract terminology from disassembly text, because the terminology is complex and variable. Herein, the construction of a domain dictionary for the disassembly of electric vehicles is a research work that has important research significance. Extracting high-quality keywords from text and categorizing them widely uses information mining, which is the basis of named entity recognition, relation extraction, knowledge questions and answers and other disassembly domain information recognition and extraction. In this paper, we propose a supervised learning dictionary construction algorithm based on multi-dimensional features that combines different features of extraction candidate keywords from the text of each scientific study. Keywords recognition is regarded as a binary classification problem using the LightGBM model to filter each keyword, and then expand the domain dictionary based on the pointwise mutual information value between keywords and its category. Here, we make use of Chinese disassembly manuals, patents and papers in order to establish a general corpus about the disassembly information and then use our model to mine the disassembly parts, disassembly tools, disassembly methods, disassembly process, and other categories of disassembly keywords. The experiment evidenced that our algorithms can significantly improve extraction and category performance better than traditional algorithms in the disassembly domain. We also investigated the performance algorithms and attempts to describe them. Our work sets a benchmark for domain dictionary construction in the field of disassembly of electric vehicles that is based on the newly developed dataset using a multi-class terminology classification. Full article

► Show Figures

Figure 1

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI