MDPI - Publisher of Open Access Journals

20 pages, 602 KB

Open AccessArticle

A Threshold Selection Method in Code Plagiarism Checking Function for Code Writing Problem in Java Programming Learning Assistant System Considering AI-Generated Codes

by Perwira Annissa Dyah Permatasari, Mustika Mentari, Safira Adine Kinari, Soe Thandar Aung, Nobuo Funabiki, Htoo Htoo Sandi Kyaw and Khaing Hsu Wai

Analytics 2026, 5(1), 2; https://doi.org/10.3390/analytics5010002 - 26 Dec 2025

Viewed by 903

Abstract

To support novice learners, the Java programming learning assistant system (JPLAS) has been developed with various features. Among them, code writing problem (CWP) assigns writing an answer code that passes a given test code. The correctness of an answer code is validated [...] Read more.

To support novice learners, the Java programming learning assistant system (JPLAS) has been developed with various features. Among them, code writing problem (CWP) assigns writing an answer code that passes a given test code. The correctness of an answer code is validated by running it on JUnit. In previous works, we implemented a code plagiarism checking function that calculates the similarity score for each pair of answer codes based on the Levenshtein distance. When the score is higher than a given threshold, this pair is regarded as plagiarism. However, a method for finding the proper threshold has not been studied. In addition, AI-generated codes have become threats in plagiarism, as AI has grown in popularity, which should be investigated. In this paper, we propose a threshold selection method based on Tukey’s IQR fences. It uses a custom upper threshold derived from the statistical distribution of similarity scores for each assignment. To better accommodate skewed similarity distributions, the method introduces a simple percentile-based adjustment for determining the upper threshold. We also design prompts to generate answer codes using generative AI and apply them to four AI models. For evaluation, we used a total of 745 source codes of two datasets. The first dataset consists of 420 answer codes across 12 CWP instances from 35 first-year undergraduate students in the State Polytechnic of Malang, Indonesia (POLINEMA). The second dataset includes 325 answer codes across five CWP assignments from 65 third-year undergraduate students at Okayama University, Japan. The applications of our proposals found the following: (1) any pair of student codes whose score is higher than the selected threshold has some evidence of plagiarism, (2) some student codes have a higher similarity than the threshold with AI-generated codes, indicating the use of generative AI, and (3) multiple AI models can generate code that resembles student-written code, despite adopting different implementations. The validity of our proposal is confirmed. Full article

(This article belongs to the Special Issue Critical Challenges in Large Language Models and Data Analytics: Trustworthiness, Scalability, and Societal Impact)

► Show Figures

Figure 1

27 pages, 4945 KB

Open AccessArticle

A Robust Framework for Coffee Bean Package Label Recognition: Integrating Image Enhancement with Vision–Language OCR Models

by Thi-Thu-Huong Le, Yeonjeong Hwang, Ahmada Yusril Kadiptya, JunYoung Son and Howon Kim

Sensors 2025, 25(20), 6484; https://doi.org/10.3390/s25206484 - 20 Oct 2025

Cited by 1 | Viewed by 2068

Abstract

Text recognition on coffee bean package labels is of great importance for product tracking and brand verification, but it poses a challenge due to variations in image quality, packaging materials, and environmental conditions. In this paper, we propose a pipeline that combines several [...] Read more.

Text recognition on coffee bean package labels is of great importance for product tracking and brand verification, but it poses a challenge due to variations in image quality, packaging materials, and environmental conditions. In this paper, we propose a pipeline that combines several image enhancement techniques and is followed by an Optical Character Recognition (OCR) model based on vision–language (VL) Qwen VL variants, conditioned by structured prompts. To facilitate the evaluation, we construct a coffee bean package image set containing two subsets, namely low-resolution (LRCB) and high-resolution coffee bean image sets (HRCB), enclosing multiple real-world challenges. These cases involve various packaging types (bottles and bags), label sides (front and back), rotation, and different illumination. To address the image quality problem, we design a dedicated preprocessing pipeline for package label situations. We develop and evaluate four Qwen-VL OCR variants with prompt engineering, which are compared against four baselines: DocTR, PaddleOCR, EasyOCR, and Tesseract. Extensive comparison using various metrics, including the Levenshtein distance, Cosine similarity, Jaccard index, Exact Match, BLEU score, and ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L), proves significant improvements upon the baselines. In addition, the public POIE dataset validation test proves how well the framework can generalize, thus demonstrating its practicality and reliability for label recognition. Full article

(This article belongs to the Special Issue Digital Imaging Processing, Sensing, and Object Recognition)

► Show Figures

Figure 1

13 pages, 2788 KB

Open AccessArticle

Visual Strategies for Guiding Gaze Sequences and Attention in Yi Symbols: Eye-Tracking Insights

by Bo Yuan and Sakol Teeravarunyou

J. Eye Mov. Res. 2025, 18(5), 57; https://doi.org/10.3390/jemr18050057 - 16 Oct 2025

Viewed by 1297

Abstract

This study investigated the effectiveness of visual strategies in guiding gaze behavior and attention on Yi graphic symbols using eye-tracking. Four strategies, color brightness, layering, line guidance, and size variation, were tested with 34 Thai participants unfamiliar with Yi symbol meanings. Gaze sequence [...] Read more.

This study investigated the effectiveness of visual strategies in guiding gaze behavior and attention on Yi graphic symbols using eye-tracking. Four strategies, color brightness, layering, line guidance, and size variation, were tested with 34 Thai participants unfamiliar with Yi symbol meanings. Gaze sequence analysis, using Levenshtein distance and similarity ratio, showed that bright colors, layered arrangements, and connected lines enhanced alignment with intended gaze sequences, while size variation had minimal effect. Bright red symbols and lines captured faster initial fixations (Time to First Fixation, TTFF) on key Areas of Interest (AOIs), unlike layering and size. Lines reduced dwell time at sequence starts, promoting efficient progression, while larger symbols sustained longer attention, though inconsistently. Color and layering showed no consistent dwell time effects. These findings inform Yi graphic symbol design for effective cross-cultural visual communication. Full article

► Show Figures

Graphical abstract

12 pages, 1332 KB

Open AccessProceeding Paper

U-Tapis: A Hybrid Approach to Melting Word Error Detection and Correction with Damerau-Levenshtein Distance and RoBERTa

by Prudence Tendy and Marlinda Vasty Overbeek

Eng. Proc. 2025, 107(1), 19; https://doi.org/10.3390/engproc2025107019 - 25 Aug 2025

Viewed by 662

Abstract

In the current digital era, the demand for rapid news delivery increases the risk of linguistic errors, including inaccuracies in the usage of melting words. This research introduces the U-Tapis application, a platform designed to detect and correct such errors using the Damerau-Levenshtein [...] Read more.

In the current digital era, the demand for rapid news delivery increases the risk of linguistic errors, including inaccuracies in the usage of melting words. This research introduces the U-Tapis application, a platform designed to detect and correct such errors using the Damerau-Levenshtein Distance algorithm and the RoBERTa model. The system achieved an average recommendation accuracy of 92.84%, with performance ranging from 91.30% to 95.45% across 3000 news articles. Despite its effectiveness, the system faces limitations, such as the static nature of its dataset, which does not update dynamically with new entries in the Indonesian Language Dictionary, and its tendency to flag all words with “me-” and “pe-” prefixes, regardless of context. These challenges highlight opportunities for future enhancements to improve the platform’s adaptability and precision. Full article

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

► Show Figures

Figure 1

20 pages, 367 KB

Open AccessArticle

Spheres of Strings Under the Levenshtein Distance

by Said Algarni and Othman Echi

Axioms 2025, 14(8), 550; https://doi.org/10.3390/axioms14080550 - 22 Jul 2025

Viewed by 1125

Abstract

Let

Σ

be a nonempty set of characters, called an alphabet. The run-length encoding

(RLE)

algorithm processes any nonempty string u over

Σ

and produces two outputs: a k-tuple [...] Read more.

Let

Σ

be a nonempty set of characters, called an alphabet. The run-length encoding

(RLE)

algorithm processes any nonempty string u over

Σ

and produces two outputs: a k-tuple

(b_{1}, b_{2}, \dots, b_{k})

, where each

b_{i}

is a character and

b_{i + 1} \neq b_{i}

; and a corresponding k-tuple

(q_{1}, q_{2}, \dots, q_{k})

of positive integers, so that the original string can be reconstructed as

u = b_{1}^{q_{1}} b_{2}^{q_{2}} \dots b_{k}^{q_{k}}

. The integer k is termed the run-length of u, and symbolized by

ρ (u)

. By convention, we let

ρ (ε) = 0

. In the Euclidean space

(R^{n}, ∥ \cdot ∥_{2})

, the volume of a sphere is determined solely by the dimension n and the radius, following well-established formulas. However, for spheres of strings under the edit metric, the situation is more complex, and no general formulas have been identified. This work intended to show that the volume of the sphere

S_{L} (u, 1)

, composed of all strings of Levenshtein distance 1 from u, is dependent on the specific structure of the “

RLE

-decomposition” of u. Notably, this volume equals

(2 l (u) + 1) s - (2 l (u) - ρ (u))

, where

ρ (u)

represents the run-length of u and

l (u)

denotes its length (i.e., the number of characters in u). Given an integer

p \geq 2

, we present a partial result concerning the computation of the volume

| S_{L} (u, p) |

in the specific case where the run-length

ρ (u) = 1

. More precisely, for a fixed integer

n \geq 1

and a character

a \in Σ

, we explicitly compute the volume of the Levenshtein sphere of radius p, centered at the string

u = a^{n}

. This case corresponds to the simplest run structure and serves as a foundational step toward understanding the general behavior of Levenshtein spheres. Full article

37 pages, 5216 KB

Open AccessArticle

Unraveling the Overall Picture of Japanese Dialect Variation: What Factors Shape the Big Picture?

by Wilbert Heeringa and Fumio Inoue

Languages 2025, 10(6), 141; https://doi.org/10.3390/languages10060141 - 12 Jun 2025

Viewed by 3294

Abstract

We studied the Japanese dialect by calculating aggregated PMI Levenshtein distances among local Japanese dialects using data from 2400 locations and 141 items from the Linguistic Atlas of Japan Database (LAJDB). Through factor analysis, we found the latent linguistic variables underlying the aggregated [...] Read more.

We studied the Japanese dialect by calculating aggregated PMI Levenshtein distances among local Japanese dialects using data from 2400 locations and 141 items from the Linguistic Atlas of Japan Database (LAJDB). Through factor analysis, we found the latent linguistic variables underlying the aggregated distances. We found two factors, the first of which reflects a division into five groups, and the second of which reflects the long-standing East/West cultural contrast in mainland Japan, also known as the AB division. In the latter division, the eastern group includes the Okinawa islands. We paid special attention to the Tokyo dialect, which is associated with Standard Japanese. In a second factor analysis, only distances to the Tokyo dialect were considered. Although the patterns represented by the four factors vary, they consistently show that dialects geographically closer to Tokyo are more similar to the Tokyo dialect. Additionally, the first three factors reflected the similarity of the Hokkaido varieties to Tokyo’s local dialect. The results of the factor analyses were linked back to the individual variation patterns of the 141 items. A more precise analysis of Tokyo’s position within the Japanese dialect continuum revealed that it is situated within a region of local dialects characterized by relatively small average linguistic distances to other dialects. This area includes the more central area of mainland Japan and Hokkaido. When the influence of geographical distance is filtered out, only the local dialects of Hokkaido remain as dialects with the smallest average distance to other local dialects. Additionally, we observed that dialects geographically close to Tokyo are most closely related to it. However, when we again use distances that are controlled for geographical distance, the local dialects on Hokkaido stand out as being very related to the Tokyo dialect. This probably indicates that the Tokyo dialect has had a relatively large influence on Hokkaido. Full article

(This article belongs to the Special Issue Dialectal Dynamics)

► Show Figures

Figure 1

29 pages, 2368 KB

Open AccessArticle

Chinese “Dialects” and European “Languages”: A Comparison of Lexico-Phonetic and Syntactic Distances

by Chaoju Tang, Vincent J. van Heuven, Wilbert Heeringa and Charlotte Gooskens

Languages 2025, 10(6), 127; https://doi.org/10.3390/languages10060127 - 29 May 2025

Cited by 2 | Viewed by 7834

Abstract

In this article, we tested some specific claims made in the literature on relative distances among European languages and among Chinese dialects, suggesting that some language varieties within the Sinitic family traditionally called dialects are, in fact, more linguistically distant from one another [...] Read more.

In this article, we tested some specific claims made in the literature on relative distances among European languages and among Chinese dialects, suggesting that some language varieties within the Sinitic family traditionally called dialects are, in fact, more linguistically distant from one another than some European varieties that are traditionally called languages. More generally, we examined whether distances among varieties within and across European language families were larger than those within and across Sinitic language varieties. To this end, we computed lexico-phonetic as well as syntactic distance measures for comparable language materials in six Germanic, five Romance and six Slavic languages, as well as for six Mandarin and nine non-Mandarin (‘southern’) Chinese varieties. Lexico-phonetic distances were expressed as the length-normalized MPI-weighted Levenshtein distances computed on the 100 most frequently used nouns in the 32 language varieties. Syntactic distance was implemented as the (complement of) the Pearson correlation coefficient found for the PoS trigram frequencies established for a parallel corpus of the same four texts translated into each of the 32 languages. The lexico-phonetic distances proved to be relatively large and of approximately equal magnitude in the Germanic, Slavic and non-Mandarin Chinese language varieties. However, the lexico-phonetic distances among the Romance and Mandarin languages were considerably smaller, but of similar magnitude. Cantonese (Guangzhou dialect) was lexico-phonetically as distant from Standard Mandarin (Beijing dialect) as European language pairs such as Portuguese–Italian, Portuguese–Romanian and Dutch–German. Syntactically, however, the differences among the Sinitic varieties were about ten times smaller than the differences among the European languages, both within and across the families—which provides some justification for the Chinese tradition of calling the Sinitic varieties dialects of the same language. Full article

(This article belongs to the Special Issue Dialectal Dynamics)

► Show Figures

Figure 1

26 pages, 22879 KB

Open AccessArticle

Exploring Tonal Variation Using Dialect Tonometry

by Ho Wang Matthew Sung and Jelena Prokić

Languages 2024, 9(12), 378; https://doi.org/10.3390/languages9120378 - 18 Dec 2024

Viewed by 2917

Abstract

Most research on dialectometry so far primarily focuses on European languages. Within these studies, analyses on the phonetic level predominantly focus on segments. A lack of studies on languages outside of Europe means that the variation in many lesser-studied languages, including tonal languages, [...] Read more.

Most research on dialectometry so far primarily focuses on European languages. Within these studies, analyses on the phonetic level predominantly focus on segments. A lack of studies on languages outside of Europe means that the variation in many lesser-studied languages, including tonal languages, is largely unknown. Tonal languages are languages which pitch is used as an indication in the lexical realisations in (at least some) morphemes, and over half of the world’s languages include lexical tones. Despite tones being the inseparable and unneglectable part of the majority of the world’s languages, there is only a handful of quantitative dialectometric studies on the dialectal variation in tonal languages. In this paper, we explore the phonetic and phonological variations in Yue, a lesser-studied tonal language spoken by around 80 million people in Southern China. Using a newly devised tone representation (modified Onset–Contour–Offset) combined with the Levenshtein distance, we explore the patterns of dialectal variation on the tonal level, as well as to what extent tonal variation correlates with segmental variation. Our results show that tones behave rather differently from segments, and thus, we illustrate that studying lesser-studied and tonal languages can contribute immensely to the study of dialect variation in general. Full article

(This article belongs to the Special Issue Dialectal Dynamics)

► Show Figures

Figure 1

29 pages, 13855 KB

Open AccessArticle

Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining

by Bilal Topaloglu, Basar Oztaysi and Onur Dogan

J. Theor. Appl. Electron. Commer. Res. 2024, 19(4), 2851-2879; https://doi.org/10.3390/jtaer19040138 - 17 Oct 2024

Cited by 7 | Viewed by 3849

Abstract

Understanding customer journeys is key to e-commerce success. Many studies have been conducted to obtain journey maps of e-commerce visitors. To our knowledge, a complete, end-to-end and structured map of e-commerce journeys is still missing. In this research, we proposed a four-step methodology [...] Read more.

Understanding customer journeys is key to e-commerce success. Many studies have been conducted to obtain journey maps of e-commerce visitors. To our knowledge, a complete, end-to-end and structured map of e-commerce journeys is still missing. In this research, we proposed a four-step methodology to extract and understand e-commerce visitor journeys using process mining. In order to obtain more structured process diagrams, we used techniques such as activity type enrichment, start and end node identification, and Levenshtein distance-based clustering in this methodology. For the evaluation of the resulting diagrams, we developed a model utilizing expert knowledge. As a result of this empirical study, we identified the most significant factors for process structuredness and their relationships. Using a real-life big dataset which has over 20 million rows, we defined activity-, behavior-, and process-level e-commerce visitor journeys. Exploitation and exploration were the most common journeys, and it was revealed that journeys with exploration behavior had significantly lower conversion rates. At the process level, we mapped the backbones of eight journeys and tested their qualities with the empirical structuredness measure. By using cart statuses at the beginning and end of these journeys, we obtained a high-level end-to-end e-commerce journey that can be used to improve recommendation performance. Additionally, we proposed new metrics to evaluate online user journeys and to benchmark e-commerce journey design success. Full article

(This article belongs to the Topic Online User Behavior in the Context of Big Data)

► Show Figures

Figure 1

18 pages, 7344 KB

Open AccessArticle

A User Location Reset Method through Object Recognition in Indoor Navigation System Using Unity and a Smartphone (INSUS)

by Evianita Dewi Fajrianti, Yohanes Yohanie Fridelin Panduman, Nobuo Funabiki, Amma Liesvarastranta Haz, Komang Candra Brata and Sritrusta Sukaridhoto

Network 2024, 4(3), 295-312; https://doi.org/10.3390/network4030014 - 22 Jul 2024

Cited by 7 | Viewed by 3037

Abstract

To enhance user experiences of reaching destinations in large, complex buildings, we have developed a indoor navigation system using Unity and a smartphone called INSUS. It can reset the user location using a quick response (QR) code to reduce the loss of [...] Read more.

To enhance user experiences of reaching destinations in large, complex buildings, we have developed a indoor navigation system using Unity and a smartphone called INSUS. It can reset the user location using a quick response (QR) code to reduce the loss of direction of the user during navigation. However, this approach needs a number of QR code sheets to be prepared in the field, causing extra loads at implementation. In this paper, we propose another reset method to reduce loads by recognizing information of naturally installed signs in the field using object detection and Optical Character Recognition (OCR) technologies. A lot of signs exist in a building, containing texts such as room numbers, room names, and floor numbers. In the proposal, the Sign Image is taken with a smartphone, the sign is detected by YOLOv8, the text inside the sign is recognized by PaddleOCR, and it is compared with each record in the Room Database using Levenshtein distance. For evaluations, we applied the proposal in two buildings in Okayama University, Japan. The results show that YOLOv8 achieved mAP@0.5

0.995

and mAP@0.5:0.95

0.978

, and PaddleOCR could extract text in the sign image accurately with an averaged CER% lower than 10%. The combination of both YOLOv8 and PaddleOCR decreases the execution time by

6.71 s

compared to the previous method. The results confirmed the effectiveness of the proposal. Full article

► Show Figures

Figure 1

48 pages, 5859 KB

Open AccessArticle

Network Models of BACE-1 Inhibitors: Exploring Structural and Biochemical Relationships

by Ömer Akgüller, Mehmet Ali Balcı and Gabriela Cioca

Int. J. Mol. Sci. 2024, 25(13), 6890; https://doi.org/10.3390/ijms25136890 - 23 Jun 2024

Viewed by 1986

Abstract

This study investigates the clustering patterns of human

β

-secretase 1 (BACE-1) inhibitors using complex network methodologies based on various distance functions, including Euclidean, Tanimoto, Hamming, and Levenshtein distances. Molecular descriptor vectors such as molecular mass, Merck Molecular Force Field (MMFF) energy, Crippen [...] Read more.

This study investigates the clustering patterns of human

β

-secretase 1 (BACE-1) inhibitors using complex network methodologies based on various distance functions, including Euclidean, Tanimoto, Hamming, and Levenshtein distances. Molecular descriptor vectors such as molecular mass, Merck Molecular Force Field (MMFF) energy, Crippen partition coefficient (ClogP), Crippen molar refractivity (MR), eccentricity, Kappa indices, Synthetic Accessibility Score, Topological Polar Surface Area (TPSA), and 2D/3D autocorrelation entropies are employed to capture the diverse properties of these inhibitors. The Euclidean distance network demonstrates the most reliable clustering results, with strong agreement metrics and minimal information loss, indicating its robustness in capturing essential structural and physicochemical properties. Tanimoto and Hamming distance networks yield valuable clustering outcomes, albeit with moderate performance, while the Levenshtein distance network shows significant discrepancies. The analysis of eigenvector centrality across different networks identifies key inhibitors acting as hubs, which are likely critical in biochemical pathways. Community detection results highlight distinct clustering patterns, with well-defined communities providing insights into the functional and structural groupings of BACE-1 inhibitors. The study also conducts non-parametric tests, revealing significant differences in molecular descriptors, validating the clustering methodology. Despite its limitations, including reliance on specific descriptors and computational complexity, this study offers a comprehensive framework for understanding molecular interactions and guiding therapeutic interventions. Future research could integrate additional descriptors, advanced machine learning techniques, and dynamic network analysis to enhance clustering accuracy and applicability. Full article

(This article belongs to the Special Issue Complex Networks, Bio-Molecular Systems, and Machine Learning 2.0)

► Show Figures

Figure 1

27 pages, 9834 KB

Open AccessArticle

Detection and Recognition of Voice Commands by a Distributed Acoustic Sensor Based on Phase-Sensitive OTDR in the Smart Home Concept

by Tatyana V. Gritsenko, Maria V. Orlova, Andrey A. Zhirnov, Yuri A. Konstantinov, Artem T. Turov, Fedor L. Barkov, Roman I. Khan, Kirill I. Koshelev, Cesare Svelto and Alexey B. Pnev

Sensors 2024, 24(7), 2281; https://doi.org/10.3390/s24072281 - 3 Apr 2024

Cited by 11 | Viewed by 3699

Abstract

In recent years, attention to the realization of a distributed fiber-optic microphone for the detection and recognition of the human voice has increased, whereby the most popular schemes are based on φ-OTDR. Many issues related to the selection of optimal system parameters and [...] Read more.

In recent years, attention to the realization of a distributed fiber-optic microphone for the detection and recognition of the human voice has increased, whereby the most popular schemes are based on φ-OTDR. Many issues related to the selection of optimal system parameters and the recognition of registered signals, however, are still unresolved. In this research, we conducted theoretical studies of these issues based on the φ-OTDR mathematical model and verified them with experiments. We designed an algorithm for fiber sensor signal processing, applied a testing kit, and designed a method for the quantitative evaluation of our obtained results. We also proposed a new setup model for lab tests of φ-OTDR single coordinate sensors, which allows for the quick variation of their parameters. As a result, it was possible to define requirements for the best quality of speech recognition; estimation using the percentage of recognized words yielded a value of 96.3%, and estimation with Levenshtein distance provided a value of 15. Full article

(This article belongs to the Special Issue Editorial Board Members' Collection Series: Optical Measurements and Sensing Technology)

► Show Figures

Figure 1

22 pages, 1343 KB

Open AccessArticle

Code Comments: A Way of Identifying Similarities in the Source Code

by Rares Folea and Emil Slusanschi

Mathematics 2024, 12(7), 1073; https://doi.org/10.3390/math12071073 - 2 Apr 2024

Cited by 2 | Viewed by 2361

Abstract

This study investigates whether analyzing the code comments available in the source code can effectively reveal functional similarities within software. The authors explore how both machine-readable comments (such as linter instructions) and human-readable comments (in natural language) can contribute towards measuring the code [...] Read more.

This study investigates whether analyzing the code comments available in the source code can effectively reveal functional similarities within software. The authors explore how both machine-readable comments (such as linter instructions) and human-readable comments (in natural language) can contribute towards measuring the code similarity. For the former, the work is relying on computing the cosine similarity over the one-hot encoded representation of the machine-readable comments, while for the latter, the focus is on detecting similarities in English comments, using threshold-based computations against the similarity measurements obtained using models based on Levenshtein distances (for form-based matches), Word2Vec (for contextual word representations), as well as deep learning models, such as Sentence Transformers or Universal Sentence Encoder (for semantic similarity). For evaluation, this research has analyzed the similarities between different source code versions of the open-source code editor, VSCode, based on existing ESlint-specific directives, as well as applying natural language processing techniques on incremental releases of Kubernetes, an open-source system for automating containerized application management. The experiments outlines the potential for detecting code similarities solely based on comments, and observations indicate that models like Universal Sentence Encoder are providing a favorable balance between recall and precision. This research is integrated into Project Martial, an open-source project for automatic assistance in detecting plagiarism in software. Full article

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Based Methods and Applications)

► Show Figures

Figure 1

17 pages, 1756 KB

Open AccessArticle

Code Plagiarism Checking Function and Its Application for Code Writing Problem in Java Programming Learning Assistant System

by Ei Ei Htet, Khaing Hsu Wai, Soe Thandar Aung, Nobuo Funabiki, Xiqin Lu, Htoo Htoo Sandi Kyaw and Wen-Chung Kao

Analytics 2024, 3(1), 46-62; https://doi.org/10.3390/analytics3010004 - 17 Jan 2024

Cited by 8 | Viewed by 4107

Abstract

A web-based Java programming learning assistant system (JPLAS) has been developed for novice students to study Java programming by themselves while enhancing code reading and code writing skills. One type of the implemented exercise problem is code writing problem (CWP), which asks [...] Read more.

A web-based Java programming learning assistant system (JPLAS) has been developed for novice students to study Java programming by themselves while enhancing code reading and code writing skills. One type of the implemented exercise problem is code writing problem (CWP), which asks students to create a source code that can pass the given test code. The correctness of this answer code is validated by running them on JUnit. In previous works, a Python-based answer code validation program was implemented to assist teachers. It automatically verifies the source codes from all the students for one test code, and reports the number of passed test cases by each code in the CSV file. While this program plays a crucial role in checking the correctness of code behaviors, it cannot detect code plagiarism that can often happen in programming courses. In this paper, we implement a code plagiarism checking function in the answer code validation program, and present its application results to a Java programming course at Okayama University, Japan. This function first removes the whitespace characters and the comments using the regular expressions. Next, it calculates the Levenshtein distance and similarity score for each pair of source codes from different students in the class. If the score is larger than a given threshold, they are regarded as plagiarism. Finally, it outputs the scores as a CSV file with the student IDs. For evaluations, we applied the proposed function to a total of 877 source codes for 45 CWP assignments submitted from 9 to 39 students and analyzed the results. It was found that (1) CWP assignments asking for shorter source codes generate higher scores than those for longer codes due to the use of test codes, (2) proper thresholds are different by assignments, and (3) some students often copied source codes from certain students. Full article

(This article belongs to the Special Issue New Insights in Learning Analytics)

► Show Figures

Figure 1

17 pages, 1801 KB

Open AccessArticle

Toward Effective Aircraft Call Sign Detection Using Fuzzy String-Matching between ASR and ADS-B Data

by Mohammed Saïd Kasttet, Abdelouahid Lyhyaoui, Douae Zbakh, Adil Aramja and Abderazzek Kachkari

Aerospace 2024, 11(1), 32; https://doi.org/10.3390/aerospace11010032 - 29 Dec 2023

Cited by 6 | Viewed by 3922

Abstract

Recently, artificial intelligence and data science have witnessed dramatic progress and rapid growth, especially Automatic Speech Recognition (ASR) technology based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). Consequently, new end-to-end Recurrent Neural Network (RNN) toolkits were developed with higher speed [...] Read more.

Recently, artificial intelligence and data science have witnessed dramatic progress and rapid growth, especially Automatic Speech Recognition (ASR) technology based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). Consequently, new end-to-end Recurrent Neural Network (RNN) toolkits were developed with higher speed and accuracy that can often achieve a Word Error Rate (WER) below 10%. These toolkits can nowadays be deployed, for instance, within aircraft cockpits and Air Traffic Control (ATC) systems in order to identify aircraft and display recognized voice messages related to flight data, especially for airports not equipped with radar. Hence, the performance of air traffic controllers and pilots can ultimately be improved by reducing workload and stress and enforcing safety standards. Our experiment conducted at Tangier’s International Airport ATC aimed to build an ASR model that is able to recognize aircraft call signs in a fast and accurate way. The acoustic and linguistic models were trained on the Ibn Battouta Speech Corpus (IBSC), resulting in an unprecedented speech dataset with approved transcription that includes real weather aerodrome observation data and flight information with a call sign captured by an ADS-B receiver. All of these data were synchronized with voice recordings in a structured format. We calculated the WER to evaluate the model’s accuracy and compared different methods of dataset training for model building and adaptation. Despite the high interference in the VHF radio communication channel and fast-speaking conditions that increased the WER level to 20%, our standalone and low-cost ASR system with a trained RNN model, supported by the Deep Speech toolkit, was able to achieve call sign detection rate scores up to 96% in air traffic controller messages and 90% in pilot messages while displaying related flight information from ADS-B data using the Fuzzy string-matching algorithm. Full article

(This article belongs to the Special Issue Automatic Speech Recognition and Understanding in Air Traffic Management)

► Show Figures

Figure 1

Search Results (33)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (33)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI