Data

12 pages, 1202 KB

Open AccessData Descriptor

Toward Responsible AI in High-Stakes Domains: A Dataset for Building Static Analysis with LLMs in Structural Engineering

by Carlos Avila, Daniel Ilbay, Paola Tapia and David Rivera

Data 2025, 10(11), 169; https://doi.org/10.3390/data10110169 (registering DOI) - 24 Oct 2025

Abstract

Modern engineering increasingly operates within socio-technical networks, such as the interdependence of energy grids, transport systems, and building codes, where decisions must be reliable and transparent. Large language models (LLMs) such as GPT promise efficiency by interpreting domain-specific queries and generating outputs, yet [...] Read more.

Modern engineering increasingly operates within socio-technical networks, such as the interdependence of energy grids, transport systems, and building codes, where decisions must be reliable and transparent. Large language models (LLMs) such as GPT promise efficiency by interpreting domain-specific queries and generating outputs, yet their predictive nature can introduce biases or fabricated values—risks that are unacceptable in structural engineering, where safety and compliance are paramount. This work presents a dataset that embeds generative AI into validated computational workflows through the Model Context Protocol (MCP). MCP enables API-based integration between ChatGPT (GPT-4o) and numerical solvers by converting natural-language prompts into structured solver commands. This creates context-aware exchanges—for example, transforming a query on seismic drift limits into an OpenSees analysis—whose results are benchmarked against manually generated ETABS models. This architecture ensures traceability, reproducibility, and alignment with seismic design standards. The dataset contains prompts, GPT outputs, solver-based analyses, and comparative error metrics for four reinforced concrete frame models designed under Ecuadorian (NEC-15) and U.S. (ASCE 7-22) codes. The end-to-end runtime for these scenarios, including LLM prompting, MCP orchestration, and solver execution, ranged between 6 and 12 s, demonstrating feasibility for design and verification workflows. Beyond providing records, the dataset establishes a reproducible methodology for integrating LLMs into engineering practice, with three goals: enabling independent verification, fostering collaboration across AI and civil engineering, and setting benchmarks for responsible AI use in high-stakes domains. Full article

► Show Figures

Figure 1

28 pages, 2676 KB

Open AccessArticle

Multi-Aspect Sentiment Classification of Arabic Tourism Reviews Using BERT and Classical Machine Learning

by Samar Zaid, Amal Hamed Alharbi and Halima Samra

Data 2025, 10(11), 168; https://doi.org/10.3390/data10110168 - 23 Oct 2025

Abstract

Understanding visitor sentiment is essential for developing effective tourism strategies, particularly as Google Maps reviews have become a key channel for public feedback on tourist attractions. Yet, the unstructured format and dialectal diversity of Arabic reviews pose significant challenges for extracting actionable insights [...] Read more.

Understanding visitor sentiment is essential for developing effective tourism strategies, particularly as Google Maps reviews have become a key channel for public feedback on tourist attractions. Yet, the unstructured format and dialectal diversity of Arabic reviews pose significant challenges for extracting actionable insights at scale. This study evaluates the performance of traditional machine learning and transformer-based models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of tourist sites across Saudi Arabia. A manually annotated dataset of more than 3500 reviews was constructed to assess model effectiveness across six tourism-related aspects: price, cleanliness, facilities, service, environment, and overall experience. Experimental results demonstrate that multi-head BERT architectures, particularly AraBERT, consistently outperform traditional classifiers in identifying aspect-level sentiment. Ara-BERT achieved an F1-score of 0.97 for the cleanliness aspect, compared with 0.91 for the best-performing classical model (LinearSVC), indicating a substantial improvement. The proposed ABSA framework facilitates automated, fine-grained analysis of visitor perceptions, enabling data-driven decision-making for tourism authorities and contributing to the strategic objectives of Saudi Vision 20300. Full article

► Show Figures

Figure 1

22 pages, 2674 KB

Open AccessReview

Beyond the List: A Framework for the Design of Next-Generation MEDLINE Search Tools

by Vladimir Zhurov, Kamran Sedig and Mostafa Milani

Data 2025, 10(10), 167; https://doi.org/10.3390/data10100167 - 21 Oct 2025

Abstract

Despite the critical importance of biomedical databases like MEDLINE, users are often hampered by search tools with stagnant designs that fail to support complex exploratory tasks. To address this limitation, we synthesized research from visual analytics and related fields to propose a new [...] Read more.

Despite the critical importance of biomedical databases like MEDLINE, users are often hampered by search tools with stagnant designs that fail to support complex exploratory tasks. To address this limitation, we synthesized research from visual analytics and related fields to propose a new design framework for non-traditional search interfaces. This framework was built upon seven core principle: visualization, interaction, machine learning, ontology, triaging, progressive disclosure, and evolutionary design. For each principle, we detail its rationale and demonstrate how its integration can transcend the limitations of conventional search tools. We contend that by leveraging this framework, designers can create more powerful and effective search tools that empower users to navigate complex information landscapes. Full article

(This article belongs to the Special Issue Interactive Visual Analytics: Bridging Human Cognition and Complex Data)

► Show Figures

Figure 1

22 pages, 2269 KB

Open AccessData Descriptor

MCR-SL: A Multimodal, Context-Rich Skin Lesion Dataset for Skin Cancer Diagnosis

by Maria Castro-Fernandez, Thomas Roger Schopf, Irene Castaño-Gonzalez, Belinda Roque-Quintana, Herbert Kirchesch, Samuel Ortega, Himar Fabelo, Fred Godtliebsen, Conceição Granja and Gustavo M. Callico

Data 2025, 10(10), 166; https://doi.org/10.3390/data10100166 - 18 Oct 2025

Abstract

Well-annotated datasets are fundamental for developing robust artificial intelligence models, particularly in medical fields. Many existing skin lesion datasets have limitations in image diversity (including only clinical or dermoscopic images) or metadata, which hinder their utility for mimicking real-world clinical practice. The purpose [...] Read more.

Well-annotated datasets are fundamental for developing robust artificial intelligence models, particularly in medical fields. Many existing skin lesion datasets have limitations in image diversity (including only clinical or dermoscopic images) or metadata, which hinder their utility for mimicking real-world clinical practice. The purpose of the MCR-SL dataset is to introduce a new, meticulously curated dataset that addresses these limitations. The MCR-SL dataset was collected from 60 subjects at the University Hospital of North Norway and comprises 779 clinical images and 1352 dermoscopic images of 240 unique lesions. The lesion types included are nevus, seborrheic keratosis, basal cell carcinoma, actinic keratosis, atypical nevus, melanoma, squamous cell carcinoma, angioma, and dermatofibroma. Labels were established by combining the consensus of a panel of four dermatologists with histopathology reports for the 29 excised lesions, with the latter serving as the gold standard. The resulting dataset provides a comprehensive resource with clinical and dermoscopic images and rich clinical context, ensuring a high level of clinical relevance, surpassing many existing resources in that matter. The MCR-SL dataset provides a holistic and reliable foundation for validating artificial intelligence models, enabling a more nuanced and clinically relevant approach to automated skin lesion diagnosis that mirrors real-world clinical practice. Full article

► Show Figures

Figure 1

20 pages, 11103 KB

Open AccessData Descriptor

VitralColor-12: A Synthetic Twelve-Color Segmentation Dataset from GPT-Generated Stained-Glass Images

by Martín Montes Rivera, Carlos Guerrero-Mendez, Daniela Lopez-Betancur, Tonatiuh Saucedo-Anaya, Manuel Sánchez-Cárdenas and Salvador Gómez-Jiménez

Data 2025, 10(10), 165; https://doi.org/10.3390/data10100165 - 18 Oct 2025

Abstract

The segmentation and classification of color are crucial stages in image processing, computer vision, and pattern recognition, as they significantly impact the results. The diverse, hand-labeled datasets in the literature are applied for monochromatic or color segmentation in specific domains. On the other [...] Read more.

The segmentation and classification of color are crucial stages in image processing, computer vision, and pattern recognition, as they significantly impact the results. The diverse, hand-labeled datasets in the literature are applied for monochromatic or color segmentation in specific domains. On the other hand, synthetic datasets are generated using statistics, artificial intelligence algorithms, or generative artificial intelligence (AI). This last one includes Large Language Models (LLMs), Generative Adversarial Neural Networks (GANs), and Variational Autoencoders (VAEs), among others. In this work, we propose VitralColor-12, a synthetic dataset for color classification and segmentation, comprising twelve colors: black, blue, brown, cyan, gray, green, orange, pink, purple, red, white, and yellow. VitralColor-12 addresses the limitations of color segmentation and classification datasets by leveraging the capabilities of LLMs, including adaptability, variability, copyright-free content, and lower-cost data—properties that are desirable in image datasets. VitralColor-12 includes pixel-level classification and segmentation maps. This makes the dataset broadly applicable and highly variable for a range of computer vision applications. VitralColor-12 utilizes GPT-5 and DALL·E 3 for generating stained-glass images. These images simplify the annotation process, since stained-glass images have isolated colors with distinct boundaries within the steel structure, which provide easy regions to label with a single color per region. Once we obtain the images, we use at least one hand-labeled centroid per color to automatically cluster all pixels based on Euclidean distance and morphological operations, including erosion and dilation. This process enables us to automatically label a classification dataset and generate segmentation maps. Our dataset comprises 910 images, organized into 70 generated images and 12 pixel segmentation maps—one for each color—which include 9,509,524 labeled pixels, 1,794,758 of which are unique. These annotated pixels are represented by RGB, HSL, CIELAB, and YCbCr values, enabling a detailed color analysis. Moreover, VitralColor-12 offers features that address gaps in public resources such as violin diagrams with the frequency of colors across images, histograms of channels per color, 3D color maps, descriptive statistics, and standardized metrics, such as ΔE76, ΔE94, and CIELAB Chromacity, which prove the distribution, applicability, and realistic perceptual structures, including warm, neutral, and cold colors, as well as the high contrast between black and white colors, offering meaningful perceptual clusters, reinforcing its utility for color segmentation and classification. Full article

► Show Figures

Figure 1

58 pages, 744 KB

Open AccessArticle

Review and Comparative Analysis of Databases for Speech Emotion Recognition

by Salvatore Serrano, Omar Serghini, Giulia Esposito, Silvia Carbone, Carmela Mento, Alessandro Floris, Simone Porcu and Luigi Atzori

Data 2025, 10(10), 164; https://doi.org/10.3390/data10100164 - 14 Oct 2025

Abstract

Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER [...] Read more.

Speech emotion recognition (SER) has become increasingly important in areas such as healthcare, customer service, robotics, and human–computer interaction. The progress of this field depends not only on advances in algorithms but also on the databases that provide the training material for SER systems. These resources set the boundaries for how well models can generalize across speakers, contexts, and cultures. In this paper, we present a narrative review and comparative analysis of emotional speech corpora released up to mid-2025, bringing together both psychological and technical perspectives. Rather than following a systematic review protocol, our approach focuses on providing a critical synthesis of more than fifty corpora covering acted, elicited, and natural speech. We examine how these databases were collected, how emotions were annotated, their demographic diversity, and their ecological validity, while also acknowledging the limits of available documentation. Beyond description, we identify recurring strengths and weaknesses, highlight emerging gaps, and discuss recent usage patterns to offer researchers both a practical guide for dataset selection and a critical perspective on how corpus design continues to shape the development of robust and generalizable SER systems. Full article

► Show Figures

Figure 1

16 pages, 5977 KB

Open AccessData Descriptor

Comparative Data Analysis of Non-Destructive Testing for Hollow Heart in Potatoes

by Mary M. Hofle, Nusrat Farheen, Mathew Zachary Shumway, Evan D. Mosher, Keyave C. Hone and Marco P. Schoen

Data 2025, 10(10), 163; https://doi.org/10.3390/data10100163 - 14 Oct 2025

Abstract

Hollow heart, and other crop defects, can be devastating to farmers. Hollow heart is not a disease but a physiological disorder affected by temperature, soil moisture, plant density, and other factors. These defects can cause substantial annual losses for farmers. Currently, potatoes are [...] Read more.

Hollow heart, and other crop defects, can be devastating to farmers. Hollow heart is not a disease but a physiological disorder affected by temperature, soil moisture, plant density, and other factors. These defects can cause substantial annual losses for farmers. Currently, potatoes are shipped and inspected from producers to shipping points and markets. At these facilities, samples are inspected for defects. Detection of hollow heart consists of halving potatoes and visually inspecting for defects. The defect size is compared to USDA hollow heart classification charts for acceptance or rejection. An automatic, non-destructive system to identify hollow heart has the potential to improve quality. Two methods have been developed to collect data for such a system: acoustic signal capture and visual/vibration signal capture. Data is collected and stored for one potato at a time. The procedure includes the collection of weight, proportional size, and volume, as well as the generation of an acoustic sound signal through a drop test and a motion signal captured through a vision system. To simulate hollow heart, potatoes are cored and retested by producing a new set of data. Each potato is manually cut and inspected for true hollow heart. The generated data includes over 1000 samples, each comprising proportional volume, weight, proportional size, motion, and acoustic data. Such a dataset does not exist in the current literature and can serve for the development of machine learning algorithms to detect hollow heart nondestructively. In this paper, the data is also analyzed in terms of its statistical properties, as applied for possible feature engineering in machine learning. Full article

► Show Figures

Figure 1

12 pages, 1191 KB

Open AccessData Descriptor

University Student Dropout: A Longitudinal Dataset of Demographic, Socioeconomic, and Academic Indicators

by Arnau Igualde-Sáez, José P. Garcia-Sabater, Juan A. Marin-Garcia, Sergio Puche García, Carlos Turró, Ignacio Despujol, Marina Alonso, José V. Benlloch-Dualde, Pedro Pablo Soriano Jiménez and Julien Maheut

Data 2025, 10(10), 162; https://doi.org/10.3390/data10100162 - 14 Oct 2025

Abstract

This dataset contains detailed information on student trajectories and dropout factors at a Spanish technological university offering Science, Technology, Engineering, Arts, and Mathematics programs. The data comprise demographic, socioeconomic, and academic variables for all enrolled students, including those in bachelor’s, master’s, doctoral, and [...] Read more.

This dataset contains detailed information on student trajectories and dropout factors at a Spanish technological university offering Science, Technology, Engineering, Arts, and Mathematics programs. The data comprise demographic, socioeconomic, and academic variables for all enrolled students, including those in bachelor’s, master’s, doctoral, and lifelong learning programs, across three complete academic years, excluding periods affected by the SARS-CoV-2 pandemic. The data were collected and standardized from disjointed internal data sources, and fully anonymized. The dataset contains information about 39,364 students, 4989 courses in 163 degrees, and 77 variables related to admission pathways, academic performance indicators, socio-demographic background, digital activity in the Learning Management System, and Wi-Fi access records. Each of the 464,739 records corresponds to a course enrolment per student per year, enabling longitudinal analyses of academic progression and dropout. This data has the potential to be reused to support research on factors influencing student retention, allow for the development of predictive models to identify students at risk of leaving their studies, and offer a resource for comparative studies in higher education. Full article

► Show Figures

Figure 1

18 pages, 2882 KB

Open AccessFeature PaperArticle

A Preferences Corpus and Annotation Scheme for Human-Guided Alignment of Time-Series GPTs

by Ricardo A. Calix, Tyamo Okosun, Chenn Zhou and Hong Wang

Data 2025, 10(10), 161; https://doi.org/10.3390/data10100161 - 9 Oct 2025

Abstract

The process of time-series forecasting such as predicting trajectories of silicon content in blast furnaces is a difficult task. Most time-series approaches today focus on scalar-type MSE loss optimization. This optimization approach, while widely common, could benefit from the use of human expert [...] Read more.

The process of time-series forecasting such as predicting trajectories of silicon content in blast furnaces is a difficult task. Most time-series approaches today focus on scalar-type MSE loss optimization. This optimization approach, while widely common, could benefit from the use of human expert or process-level preferences. In this paper, we introduce a novel alignment and fine-tuning approach that involves learning from a corpus of preferred and dis-preferred time-series prediction trajectories. Our contributions include (1) a preference annotation pipeline for time-series forecasts, (2) the application of Score-based Preference Optimization (SPO) to train decoder-only transformers from preferences, and (3) results showing improvements in forecast quality. The approach is validated on both proprietary blast furnace data and the UCI Appliances Energy dataset. The proposed preference corpus and training strategy offer a new option for fine-tuning sequence models in industrial settings. Full article

► Show Figures

Figure 1

14 pages, 1530 KB

Open AccessArticle

Assessing Musculoskeletal Injury Risk in Hospital Healthcare Professionals During a Single Daily Patient-Handling Task

by Xiaoxu Ji, Thomaz Ahualli de Sanctis, Mahmoud Alwahkyan, Xin Gao, Jenna Miller and Sarah Thomas

Data 2025, 10(10), 160; https://doi.org/10.3390/data10100160 - 8 Oct 2025

Abstract

Background: Healthcare professionals are at significant risk of musculoskeletal injuries due to the physically demanding nature of patient-handling tasks. While various ergonomic interventions have been introduced to mitigate these risks, comprehensive methods for assessing and addressing musculoskeletal hazards remain limited. Purpose: This study [...] Read more.

Background: Healthcare professionals are at significant risk of musculoskeletal injuries due to the physically demanding nature of patient-handling tasks. While various ergonomic interventions have been introduced to mitigate these risks, comprehensive methods for assessing and addressing musculoskeletal hazards remain limited. Purpose: This study presents a novel approach to evaluating musculoskeletal injury risks among healthcare workers, marking the first instance in which two motion tracking systems are used simultaneously. This dual-system setup enables a more comprehensive and dynamic analysis of worker interactions in real time. Healthcare professionals were divided into three groups to perform patient transfer tasks. Three key poses within the task, associated with peak lumbar forces, were identified and analyzed. Results: The resulting compressive forces on the participants’ lower back ranged from 581.0 N to 3589.1 N, and the Anterior–Posterior (A/P) shear forces ranged from 33.1 N to 912.3 N across the three poses. Relative differences in trunk flexion showed strong correlations with compressive and A/P shear forces at each pose, respectively. Discussion and conclusion: Strong associations were found between lumbar loads and participant’s anthropometrics. Recommendations for optimal postures and partner pairings were developed to help reduce the risk of lower back injuries during patient handling. Full article

► Show Figures

Figure 1

9 pages, 616 KB

Open AccessArticle

Expected Shot Impact Timing (xSIT) and Other Advanced Metrics as Indicators of Performance in English Men’s and Women’s Professional Football

by Blanca De-la-Cruz-Torres, Miguel Navarro-Castro and Anselmo Ruiz-de-Alarcón-Quintero

Data 2025, 10(10), 159; https://doi.org/10.3390/data10100159 - 2 Oct 2025

Abstract

Blackground: Football performance analysis has grown rapidly in recent years, with increasing interest in advanced metrics to more accurately evaluate both individual and team performance. The aim of this study was to examine the utility of the Expected Shots Impact Timing (xSIT) metric [...] Read more.

Blackground: Football performance analysis has grown rapidly in recent years, with increasing interest in advanced metrics to more accurately evaluate both individual and team performance. The aim of this study was to examine the utility of the Expected Shots Impact Timing (xSIT) metric as an indicator of shooting performance in English professional football, specifically in the men’s Premier League (PL) and the Women’s Super League (WSL). Methods: A total of 9831 shots from the PL (2015/16 season) and 3219 shots from the WSL (2020/21 season) were analyzed. Data were obtained from publicly accessible football databases. The variables examined included goals, Possession Value (PV), Expected Goals (xG), Expected Goals on Target (xGOT), and xSIT. All variables were normalized per match (90 min). Descriptive statistics, correlational analyses, and comparative analyses between leagues. Results: The WSL exhibited a significantly higher PV than the PL (p < 0.001), whereas the remaining metrics showed no significant differences between leagues (p > 0.05). Moreover, in the WSL, all performance indicators displayed very strong correlations with goals, while in the PL, similarly strong associations were observed, except for PV, which showed only a weak relationship. Conclusions: the xSIT metric, as an indicator of shooting performance, may be regarded as an influential factor in determining match outcomes across both leagues. Full article

(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)

► Show Figures

Figure 1

13 pages, 724 KB

Open AccessArticle

Research on the Development and Application of the GDELT Event Database

by Dengxi Hong, Zexin Fu, Xin Zhang and Yan Pan

Data 2025, 10(10), 158; https://doi.org/10.3390/data10100158 - 1 Oct 2025

Abstract

This study investigates the development and application of the GDELT (Global Database of Events, Language, and Tone) news database. Through experiments, we conducted a quantitative statistical analysis of the GDELT event database to evaluate its practical characteristics. The results indicate that although the [...] Read more.

This study investigates the development and application of the GDELT (Global Database of Events, Language, and Tone) news database. Through experiments, we conducted a quantitative statistical analysis of the GDELT event database to evaluate its practical characteristics. The results indicate that although the database achieves comprehensive coverage across all countries and regions and includes most major global media outlets, the accuracy rate of its key fields is only approximately 55%, with a data redundancy as high as 20%. Based on these findings, while the GDELT data demonstrates good coverage and data integrity, data correction and deduplication are recommended before its use in research contexts and industrial applications. Subsequently, a survey of the existing literature reveals that current studies using GDELT primarily focused on event-related metrics, such as event quantity, tone, and GoldsteinScale, for application in international relations analysis, crisis event prediction, policy effectiveness testing, and public opinion impact analysis. Nevertheless, news constitutes a fundamental channel of information dissemination in media networks, and the propagation of news events through these networks represents a critical area of study for information recommendation, public opinion guidance, and crisis intervention. Existing research has employed the Event, GKG, and Mentions tables to construct cross-national news flow network models. However, the informational correlations across different data table fields have not been fully leveraged in preliminary data selection, leading to substantial computational overhead. To advance research in this field, this study employs chained list queries on the Event and Mentions tables within GDELT. Using social network analysis, we constructed a media co-occurrence network of event reports, through which core hubs and associative relationships within the event dissemination network are identified. Full article

► Show Figures

Figure 1

10 pages, 2446 KB

Open AccessData Descriptor

A Multi-Class Labeled Ionospheric Dataset for Machine Learning Anomaly Detection

by Aleksandra Kolarski, Filip Arnaut, Sreten Jevremović, Zoran R. Mijić and Vladimir A. Srećković

Data 2025, 10(10), 157; https://doi.org/10.3390/data10100157 - 30 Sep 2025

Abstract

The binary anomaly detection (classification) of ionospheric data related to Very Low Frequency (VLF) signal amplitude in prior research demonstrated the potential for development and further advancement. Further data quality improvement is integral for advancing the development of machine learning (ML)-based ionospheric data [...] Read more.

The binary anomaly detection (classification) of ionospheric data related to Very Low Frequency (VLF) signal amplitude in prior research demonstrated the potential for development and further advancement. Further data quality improvement is integral for advancing the development of machine learning (ML)-based ionospheric data (VLF signal amplitude) anomaly detection. This paper presents the transition from binary to multi-class classification of ionospheric signal amplitude datasets. The dataset comprises 19 transmitter–receiver pairs and 383,041 manually labeled amplitude instances. The target variable was reclassified from a binary classification (normal and anomalous data points) to a six-class classification that distinguishes between daytime undisturbed signals, nighttime signals, solar flare effects, instrument errors, instrumental noise, and outlier data points. Furthermore, in addition to the dataset, we developed a freely accessible web-based tool designed to facilitate the conversion of MATLAB data files to TRAINSET-compatible formats, thereby establishing a completely free and open data pipeline from the WALDO world data repository to data labeling software. This novel dataset facilitates further research in ionospheric signal amplitude anomaly detection, concentrating on effective and efficient anomaly detection in ionospheric signal amplitude data. The potential outcomes of employing anomaly detection techniques on ionospheric signal amplitude data may be extended to other space weather parameters in the future, such as ELF/LF datasets and other relevant datasets. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

20 pages, 970 KB

Open AccessArticle

Automated Test Generation Using Large Language Models

by Marcin Andrzejewski, Nina Dubicka, Jędrzej Podolak, Marek Kowal and Jakub Siłka

Data 2025, 10(10), 156; https://doi.org/10.3390/data10100156 - 30 Sep 2025

Abstract

This study explores the potential of generative AI, specifically Large Language Models (LLMs), in automating unit test generation in Python 3.13. We analyze tests, both those created by programmers and those generated by LLM models, for fifty source code cases. Our main focus [...] Read more.

This study explores the potential of generative AI, specifically Large Language Models (LLMs), in automating unit test generation in Python 3.13. We analyze tests, both those created by programmers and those generated by LLM models, for fifty source code cases. Our main focus is on how the choice of model, the difficulty of the source code, and the prompting strategy influence the quality of the generated tests. The results show that AI models can help automate test creation for simple code, but their effectiveness decreases for more complex tasks. We introduce an embedding-based similarity analysis to assess how closely AI-generated tests resemble human-written ones, revealing that AI outputs often lack semantic diversity. The study also highlights the potential of AI models for rapid test prototyping, which can significantly speed up the software development cycle. However, further customization and training of the models on specific use cases is needed to achieve greater precision. Our findings provide practical insights into integrating LLMs into software testing workflows and emphasize the importance of prompt design and model selection. Full article

► Show Figures

Figure 1

12 pages, 1732 KB

Open AccessData Descriptor

A Dataset of Environmental Toxins for Water Monitoring in Coastal Waters of Southern Centre, Vietnam: Case of Nha Trang Bay

by Hoang Xuan Ben, Tran Cong Thinh and Phan Minh-Thu

Data 2025, 10(10), 155; https://doi.org/10.3390/data10100155 - 29 Sep 2025

Abstract

This study presents a comprehensive dataset developed to monitor coastal water quality in the south-central region of Vietnam, focusing on Nha Trang Bay. Environmental data were collected from four research cruises conducted between 2013 and 2024. Water samples were taken at two depths: [...] Read more.

This study presents a comprehensive dataset developed to monitor coastal water quality in the south-central region of Vietnam, focusing on Nha Trang Bay. Environmental data were collected from four research cruises conducted between 2013 and 2024. Water samples were taken at two depths: surface samples at approximately 0.5–1.0 m below the water surface, and bottom samples 1.0 to 2.0 m above the seabed, depending on site-specific bathymetry. These samples were analyzed for key water quality parameters, including biological oxygen demand (BOD₅), dissolved inorganic nitrogen (DIN), dissolved inorganic phosphorus (DIP), and Chlorophyll-a (Chl-a). The data establish a valuable baseline for assessing both spatial and temporal patterns of water quality, and for calculating eutrophication index to evaluate potential environmental degradation. Importantly, it also demonstrates practical applications for environmental management. The dataset can support assessments of how seasonal tourism peaks contribute to nutrient enrichment, how aquaculture expansion affects dissolved oxygen dynamics, and how water quality trends evolve under increasing anthropogenic pressure. These applications make it a useful resource for evaluating pollution control efforts and for guiding sustainable development in coastal areas. By promoting open access, the dataset not only supports scientific research but also strengthens evidence-based management strategies to protect ecosystem health and socio-economic resilience in Nha Trang Bay. Full article

► Show Figures

Figure 1

18 pages, 1089 KB

Open AccessData Descriptor

Digital Accessibility of Solar Energy Variability Through Short-Term Measurements: Data Descriptor

by Fernando Venâncio Mucomole, Carlos Augusto Santos Silva and Lourenço Lázaro Magaia

Data 2025, 10(10), 154; https://doi.org/10.3390/data10100154 - 28 Sep 2025

Abstract

A variety of factors, such as absorption, reflection, and attenuation by atmospheric elements, influence the quantity of solar energy that reaches the surface of the Earth. This, in turn, impacts photovoltaic (PV) power generation. In light of this, a digital assessment of solar [...] Read more.

A variety of factors, such as absorption, reflection, and attenuation by atmospheric elements, influence the quantity of solar energy that reaches the surface of the Earth. This, in turn, impacts photovoltaic (PV) power generation. In light of this, a digital assessment of solar energy variability through short-term measurements was conducted to enhance PV power output. The clear-sky index

(K_{t}^{*})

methodology was employed, effectively eliminating any indications of solar energy obstruction and comparing the measured radiation to the theoretical clear-sky radiation. The solar energy data were gathered in Mozambique, specifically in the southern region at Maputo–1, Massangena, Ndindiza, and Pembe, in the mid-region at Chipera, Nhamadzi, Barue–1, and Barue–2, as well as in the northern region at Nipepe-1, Nipepe-2, Nanhupo-1, Nanhupo-2, and Chomba, over the period from 2005 to 2024, with measurement intervals ranging from 1 to 10 min and 1 h during the measurement campaigns conducted by FUNAE and INAM, with additional data sourced from the PVGIS, Meteonorm, NOAA, and NASA solar databases. The analysis indicates a

K_{t}^{*}

value with a density approaching 1 for clear days, while intermediate-sky days exhibit characteristics that lie between those of clear and cloudy days. It can be inferred that there exists a robust correlation among sky types, with values ranging from 0.95 to 0.89 per station, alongside correlated energies, which experience a regression with coefficients between 0.79 and 0.95. Based on the analysis of the sample, the region demonstrates significant potential for solar energy utilization, and similar sampling methodologies can be applied in other locations to optimize PV output and other solar energy projects. Full article

(This article belongs to the Topic Smart Energy Systems, 2nd Edition)

► Show Figures

Figure 1

17 pages, 3841 KB

Open AccessArticle

Sliding Performance Evaluation with Machine Learning-Based Trajectory Analysis for Skeleton

by Ting Yu, Zhen Peng, Zining Wang, Weiya Chen and Bo Huo

Data 2025, 10(10), 153; https://doi.org/10.3390/data10100153 - 24 Sep 2025

Abstract

Skeleton is an extreme sliding sport in the Winter Olympics, where formulating targeted sliding strategies, based on training videos to navigate complex tracks, is particularly important. To make in-depth use of training video records, this study proposes an analytical method based on Mixture [...] Read more.

Skeleton is an extreme sliding sport in the Winter Olympics, where formulating targeted sliding strategies, based on training videos to navigate complex tracks, is particularly important. To make in-depth use of training video records, this study proposes an analytical method based on Mixture of Gaussians (MoG) and K-means clustering to extract and analyze trajectories from recorded videos for sliding performance evaluation and strategy development. A case study was conducted using data from the Chinese national skeleton team at the Yanqing Sliding Center, obtaining 741, 834, and 726 sliding trajectories from three representative curves. These trajectories were divided into groups based on sliding completion time (fast, medium, and slow groups). The consistency of trajectories within each group was calculated to evaluate sliding stability, while trajectory patterns in the fast group were clustered and described based on the average values of multiple features (starting position, ending position, and apex orthogonal offset). The results showed that more skilled athletes exhibited greater sliding stability (lower

ρ_{C}

-values), and on each curve, there were sliding patterns that performed significantly better than others. This research quantifies the characteristics of athletes’ sliding trajectories on curves, facilitating the visual tracking of training effects and the development of personalized strategies. It provides coaches and athletes with scientific decision-making support and clear directions for improvement, ultimately enabling precise enhancements in training efficiency and competitive performance, while also laying a technical foundation for the future development of intelligent training systems. Full article

(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)

► Show Figures

Figure 1

10 pages, 2476 KB

Open AccessData Descriptor

In Situ Monitoring and Bioluminescence Kinetics of Pseudomonas fluorescens M3A Bioluminescent Reporter with Bacteriophage ΦS1

by Phillip R. Myer, Pankaj Bhatt, Halis Simsek and Bruce M. Applegate

Data 2025, 10(10), 152; https://doi.org/10.3390/data10100152 - 23 Sep 2025

Abstract

Food spoilage and the associated organisms are a continuing concern for the food industry. The microorganisms involved with food spoilage in pasteurized milk can be introduced in a variety of ways, which include those that survive pasteurization and/or are introduced post-pasteurization. The use [...] Read more.

Food spoilage and the associated organisms are a continuing concern for the food industry. The microorganisms involved with food spoilage in pasteurized milk can be introduced in a variety of ways, which include those that survive pasteurization and/or are introduced post-pasteurization. The use of bacteriophages for therapeutic regimens and as a method for the biocontrol of food-borne pathogens has been widely studied and applied; however, their use in the biocontrol against spoilage organisms is in its nascency. Bioluminescent bacteria offer the ability to act as cell-death reporters. In the case of using bacteriophage against spoilage-associated bacteria, cell death results in the loss of bioluminescence. In this study, a bioluminescent Pseudomonas species, Pseudomonas fluorescens M3A, was used to monitor the efficacy of the bacteriophage-associated biocontrol system within laboratory bacterial growth broth and fluid milk using bacteriophage ΦS1. Utilizing a bioluminescence kinetic assay with ten-fold serially diluted P. fluorescens M3A and bacteriophage ΦS1, data demonstrated rapid inactivation of bacterial growth, and at low bacteriophage titers. Cell death was indicated by the loss of bacterial bioluminescence. These data help to support the application of bacteriophage-based technologies against spoilage-associated bacteria to prolong shelf life in the event of microbial growth. Full article

► Show Figures

Figure 1

24 pages, 4286 KB

Open AccessArticle

Validation of Anthropogenic Emission Inventories in Japan: A WRF-Chem Comparison of PM₂.₅, SO₂, NO_x and CO Against Observations

by Kenichi Tatsumi and Nguyen Thi Hong Diep

Data 2025, 10(9), 151; https://doi.org/10.3390/data10090151 - 22 Sep 2025

Abstract

Reliable, high-resolution emission inventories are essential for accurately simulating air quality and for designing evidence-based mitigation policies. Yet their performance over Japan—where transboundary inflow, strict fuel regulations, and complex source mixes coexist—remains poorly quantified. This study therefore benchmarks four widely used anthropogenic inventories—REAS [...] Read more.

Reliable, high-resolution emission inventories are essential for accurately simulating air quality and for designing evidence-based mitigation policies. Yet their performance over Japan—where transboundary inflow, strict fuel regulations, and complex source mixes coexist—remains poorly quantified. This study therefore benchmarks four widely used anthropogenic inventories—REAS v3.2.1, CAMS-GLOB-ANT v6.2, ECLIPSE v6b, and HTAP v3—by coupling each to WRF-Chem (10 km grid) and comparing simulated surface PM₂.₅, SO₂, CO, and NO_x with observations from >900 stations across eight Japanese regions for the years 2010 and 2015. All simulations shared identical meteorology, chemistry, and natural-source inputs (MEGAN 2.1 biogenic VOCs; FINN v1.5 biomass burning) so that differences in model output isolate the influence of anthropogenic emissions. HTAP delivered the most balanced SO₂ and CO fields (regional mean biases mostly within ±25%), whereas ECLIPSE reproduced NO_x spatial gradients best, albeit with a negative overall bias. REAS captured industrial SO₂ reliably but over-estimated PM₂.₅ and NO_x in western conurbations while under-estimating them in rural prefectures. CAMS-GLOB-ANT showed systematic biases—under-estimating PM₂.₅ and CO yet markedly over-estimating SO₂—highlighting the need for Japan-specific sulfur-fuel adjustments. For several pollutant–region combinations, absolute errors exceeded 100%, confirming that emissions uncertainty, not model physics, dominates regional air quality error even under identical dynamical and chemical settings. These findings underscore the importance of inventory-specific and pollutant-specific selection—or better, multi-inventory ensemble approaches—when assessing Japanese air quality and formulating policy. Routine assimilation of ground and satellite data, together with inverse modeling, is recommended to narrow residual biases and improve future inventories. Full article

► Show Figures

Figure 1

Journal Description

Latest Articles

Journal Menu

Journal Browser

Highly Accessed Articles

Latest Books

E-Mail Alert

News

Topics

Conferences

Special Issues

Topical Collections

Further Information

Guidelines

MDPI Initiatives

Follow MDPI