Next Issue
Volume 11, March
Previous Issue
Volume 11, January
 
 

Data, Volume 11, Issue 2 (February 2026) – 20 articles

Cover Story (view full-size image): The image shows integrated vapour transport (IVT) of a high-impact atmospheric river (AR) making landfall over the Pacific Northwest at 0000 UTC on 10 December 2025. ARs are narrow corridors of intense moisture transport that can trigger extreme precipitation, flooding, gusty winds, and rapid temperature changes. This paper introduces S-EDARA, a supplementary dataset to the ERA5-based Dataset for Atmospheric River Analysis (EDARA). S-EDARA incorporates AR shapes detected by the tARget version 4 algorithm, strong IVT indicators, and pseudo total precipitation rate fields derived from moisture convergence. Interactive graphical catalogues enable comprehensive visualization of AR-related hazards at 6-hourly intervals from 1940 to present. Together, EDARA and S-EDARA offer powerful tools to advance understanding of AR characteristics, impacts, and trends on a global scale. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
14 pages, 307 KB  
Data Descriptor
Dataset on Suicide Risk, Substance Abuse, and Family Functioning Among University Students in Cali, Colombia
by Naydu Acosta-Ramírez, Jorge Mario Angulo-Mosquera and Alejandro Botero-Carvajal
Data 2026, 11(2), 43; https://doi.org/10.3390/data11020043 - 23 Feb 2026
Viewed by 513
Abstract
Globally, one in eight people experience a mental disorder, which constitutes a leading cause of years lived with disability and disproportionately affects young people. Gaps in scientific knowledge have been identified, with limited studies in university students. This article presents an open-access database [...] Read more.
Globally, one in eight people experience a mental disorder, which constitutes a leading cause of years lived with disability and disproportionately affects young people. Gaps in scientific knowledge have been identified, with limited studies in university students. This article presents an open-access database on mental health and family functioning, collected through a survey of undergraduate students in health sciences programs at a private university in Cali (Colombia). The purpose was to explore suicide risk, substance abuse and family functioning using three structured questionnaires (Family APGAR, Dast-10, and PANSI), together with sociodemographic variables, organized in four sections (family and peer support, substance use, suicidal ideation, and background). The results of the article correspond to the database description, which includes finally 574 records obtained from students of health sciences programs (medicine, dentistry, psychology, prehospital care, nursing, dental mechanics). The data are provided as raw, analyzable files (spreadsheet formats) free of charge from Mendeley Data. In conclusion, the scientific impact of these data lies in their potential to be reused by researchers and higher-education decision-makers for secondary analyses that guide the development of mental and family health interventions for groups linked to undergraduate programs in the health sector. Full article
Show Figures

Figure 1

18 pages, 4591 KB  
Data Descriptor
Individual-Level Behavioral Dataset Linking Trace Eyeblink Conditioning, Contextual Fear Memory, and Home-Cage Activities in rTg4510 and Wild-Type Mice with Doxycycline Treatment
by Ryo Kachi, Takuma Nishijo and Yasushi Kishimoto
Data 2026, 11(2), 42; https://doi.org/10.3390/data11020042 - 16 Feb 2026
Viewed by 406
Abstract
This dataset provides synchronized multimodal behavioral measurements from 36 mice across four experimental groups: wild-type and rTg4510 tauopathy mice, each tested with or without doxycycline-mediated suppression of mutant tau expression. Of these, 34 mice had complete measurements across all three behavioral paradigms and [...] Read more.
This dataset provides synchronized multimodal behavioral measurements from 36 mice across four experimental groups: wild-type and rTg4510 tauopathy mice, each tested with or without doxycycline-mediated suppression of mutant tau expression. Of these, 34 mice had complete measurements across all three behavioral paradigms and were used for analyses requiring full cross-task linkage. At six months of age, all animals underwent three standardized behavioral paradigms: home cage monitoring, ten-day trace eyeblink conditioning, and contextual fear conditioning. The individual-level data included locomotor activity, rearing duration, conditioned response metrics, eyelid closure latencies, and contextual freezing percentages. All measurements were linked using unique mouse identifiers, enabling cross-task analysis without preprocessing or imputation. The dataset was accompanied by a complete data dictionary, processing workflow diagram, and validation analyses demonstrating cross-paradigm correlations. The cross-task associations are illustrated in the main figures, with additional early phase acquisition and temporal processing correlations provided in the main figures. Provided in an open CSV format with detailed metadata, this resource supports behavioral phenotyping, machine learning applications, and the investigation of learning mechanisms in tauopathy models. Full article
Show Figures

Graphical abstract

12 pages, 561 KB  
Data Descriptor
Perceptions of Security, Victimization, and Coexistence: A Database from Cali, Colombia
by Jhon James Mora, Enrique Javier Burbano-Valencia, Angie Mondragón-Mayo and José Santiago Arroyo Mina
Data 2026, 11(2), 41; https://doi.org/10.3390/data11020041 - 14 Feb 2026
Cited by 1 | Viewed by 658
Abstract
This article addresses a key evidence gap in urban safety policy in Colombia: the absence of publicly accessible microdata that jointly measure victimization, perception of security, and probability of sanctions among socioeconomically vulnerable residents. It aims to provide a clean, linkable dataset that [...] Read more.
This article addresses a key evidence gap in urban safety policy in Colombia: the absence of publicly accessible microdata that jointly measure victimization, perception of security, and probability of sanctions among socioeconomically vulnerable residents. It aims to provide a clean, linkable dataset that enables analysis of variations in these issues across demographic and territorial groups in Cali (recently classified as the 29th most dangerous city worldwide, with 1028 and 1065 homicides in 2024 and 2025, respectively). It reports face-to-face survey data collected from 22 July to 16 August 2024, at Sistema de Identificación de Potenciales Beneficiarios de Programas Sociales (SISBEN) service points. The final dataset includes 2139 adults (aged 18–95 years) and combines (i) primary responses on perceived safety (e.g., public space safety and surveillance cameras), perceived likelihood of sanction, victimization, and self-protection measures with (ii) selected sociodemographic and household characteristics drawn from SISBEN IV records. Individual-level linkage was implemented using respondent identification at interviews, yielding an integrated anonymized file suitable for replication and secondary analysis. The dataset enables distributive analyses of insecurity (e.g., by sex, age, and ethnicity—including Afro-descendant populations) within a policy-relevant target group and supports evaluation and targeting of local interventions by providing individual-level indicators. Full article
Show Figures

Figure 1

13 pages, 2157 KB  
Data Descriptor
Georeferenced Snow Depth and Snow Water Equivalent Dataset (2025) from East Kazakhstan Region
by Dmitry Chernykh, Roman Biryukov, Lilia Lubenets, Andrey Bondarovich, Nurassyl Zhomartkan, Almasbek Maulit, Dauren Nurekenov, Kamilla Rakhymbek, Yerzhan Baiburin and Aliya Nugumanova
Data 2026, 11(2), 40; https://doi.org/10.3390/data11020040 - 13 Feb 2026
Viewed by 583
Abstract
In this work, we present the Snow Depth and Snow Water Equivalent Dataset for specific areas located in the East Kazakhstan Region that can be exploited to monitor and understand water resource dynamics in mountain regions. The present dataset represents a georeferenced collection [...] Read more.
In this work, we present the Snow Depth and Snow Water Equivalent Dataset for specific areas located in the East Kazakhstan Region that can be exploited to monitor and understand water resource dynamics in mountain regions. The present dataset represents a georeferenced collection of snow depth, snow density, and derived snow water equivalent (SWE) measurements obtained through manual snow surveys. Snow survey observations were conducted during field campaigns in the East Kazakhstan Region during the period of maximum snow accumulation from 27 February to 6 March 2025. Snow survey sites were selected to maximize coverage of diverse landscape settings and snow accumulation conditions. In total, 111 snow survey sites were established across the East Kazakhstan Region, and 2331 snow depth measurements and 555 snow density measurements were collected. In post-field (laboratory) processing, snow water equivalent (SWE) was calculated for all snow survey sites based on measured snow depth and snow density values. Full article
Show Figures

Figure 1

10 pages, 1011 KB  
Article
The Role of Shot Velocity in Advanced Post-Shot Metrics: Evidence from the UEFA European Football Championships
by Blanca De-la-Cruz-Torres, Anselmo Ruiz-de-Alarcón-Quintero and Miguel Navarro-Castro
Data 2026, 11(2), 39; https://doi.org/10.3390/data11020039 - 13 Feb 2026
Viewed by 595
Abstract
Introduction: Ball velocity is a critical determinant of shot effectiveness in football, yet its influence on advanced post-shot metrics, such as expected shot impact timing (xSIT) and expected goals on target (xGOT), remains poorly understood, particularly in the context of sex-specific differences. This [...] Read more.
Introduction: Ball velocity is a critical determinant of shot effectiveness in football, yet its influence on advanced post-shot metrics, such as expected shot impact timing (xSIT) and expected goals on target (xGOT), remains poorly understood, particularly in the context of sex-specific differences. This study examined the relationship between ball velocity and these metrics in men’s and women’s elite European tournaments. Methods: A total of 2174 shots were analyzed from all matches of the 2024 UEFA Men’s EURO (n = 1305) and 2025 UEFA Women’s EURO (n = 869), classified as goal shots on target, non-goal shots on target, and shots off target. Ball velocity was measured for each shot, and its associations with xSIT, our own xGOT model and the StatsBomb xGOT model were quantified using correlation coefficients. Results: Ball velocity differed significantly between sexes (p < 0.001), with higher values in men, and goal shots on target exhibited lower velocities than non-goal or off-target shots, indicating a speed–accuracy trade-off. Only xSIT and our own xGOT model were sensitive to ball velocity, reflecting sex-specific differences (p < 0.001). When comparing shot types across advanced metrics, a consistent trend was observed in both tournaments: xSIT showed no significant differences between goal and non-goal shots, whereas both xGOT models were higher for goal shots on target. Correlations indicated a moderate positive relationship between xSIT and ball velocity, and moderate negative correlations for both xGOT models, slightly stronger in men. Conclusions: Ball velocity is a critical factor influencing shot performance and advanced post-shot metrics, with notable sex-specific differences. Full article
(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)
Show Figures

Figure 1

18 pages, 6606 KB  
Data Descriptor
Annotated IoT Dataset of Waste Collection Events
by Peter Tarábek, Andrej Michalek, Roman Hriník, Ľubomír Králik and Karol Decsi
Data 2026, 11(2), 38; https://doi.org/10.3390/data11020038 - 11 Feb 2026
Viewed by 536
Abstract
This work presents a curated dataset of multimodal sensor measurements from Internet of Things (IoT) units mounted on waste collection vehicles. Each unit records multiple data streams including GPS position, vehicle velocity, radar-based container presence, accelerometer readings of the lifting arm, and RFID [...] Read more.
This work presents a curated dataset of multimodal sensor measurements from Internet of Things (IoT) units mounted on waste collection vehicles. Each unit records multiple data streams including GPS position, vehicle velocity, radar-based container presence, accelerometer readings of the lifting arm, and RFID tag identifiers of the bins. The dataset provides two complementary forms of annotation: (1) algorithmically generated events that were manually cleaned through visual inspection of sensor signals, offering large-scale coverage across 5 vehicles over a total of 25 collection days, and (2) manually validated events derived from synchronized video recordings, representing ground truth for 3 vehicles over 8 collection days. In total, the dataset contains 12,391 annotated waste collection events. The dataset spans diverse operational conditions with varying container sizes and includes both RFID-equipped and non-RFID bins. It can be used to train and evaluate machine learning models for event detection, anomaly recognition, or explainability studies, and to support practical applications such as Pay-as-you-throw (PAYT) waste management schemes. By combining multimodal sensor signals with reliable annotations, the dataset represents a unique resource for advancing research in smart waste collection and the broader field of IoT-enabled urban services. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

25 pages, 18392 KB  
Data Descriptor
A Century of Migration (1830–1939): 735,000 Enriched Records from Bremen’s Ship Passenger Lists
by Tobias Perschl, Pauline Schmidt, Sebastian Gassner and Malte Rehbein
Data 2026, 11(2), 37; https://doi.org/10.3390/data11020037 - 10 Feb 2026
Viewed by 969
Abstract
This paper publishes 735,000 historical passenger entries from the German North Sea port of Bremen, created between 1830 and 1939, and now structured, enriched, and processed into a research-ready database. It provides an overview of the original archival documents and their datafication, beginning [...] Read more.
This paper publishes 735,000 historical passenger entries from the German North Sea port of Bremen, created between 1830 and 1939, and now structured, enriched, and processed into a research-ready database. It provides an overview of the original archival documents and their datafication, beginning with a historical account of why the passenger lists were created and which information they recorded. Building on extensive prior work—largely carried out by a team of volunteer transcribers with expertise in family history and genealogy—the lists were transcribed manually and first made available online in 2003. To enhance their analytical value, we computationally post-processed these data through (1) data cleaning, especially addressing spelling variants and transcription errors; (2) data normalisation, including conversion into standardised formats; and (3) data augmentation by adding identifiers, geographic information, and multiple classifications. Finally, we discuss limitations of the resulting dataset as well as its analytical potential. Full article
Show Figures

Figure 1

21 pages, 12506 KB  
Data Descriptor
S-EDARA: An Atmospheric River Dataset Supplement to EDARA for Impact Assessment
by Ruping Mo
Data 2026, 11(2), 36; https://doi.org/10.3390/data11020036 - 10 Feb 2026
Viewed by 644
Abstract
Atmospheric rivers (ARs) play a critical role in producing high-impact weather events, including extreme precipitation, flooding, gusty winds, and rapid temperature changes. Building upon the recently published EDARA (ERA5-based Dataset for Atmospheric River Analysis), we present S-EDARA, a supplementary dataset that enhances AR [...] Read more.
Atmospheric rivers (ARs) play a critical role in producing high-impact weather events, including extreme precipitation, flooding, gusty winds, and rapid temperature changes. Building upon the recently published EDARA (ERA5-based Dataset for Atmospheric River Analysis), we present S-EDARA, a supplementary dataset that enhances AR impact assessment capabilities through a newer AR detection algorithm and additional impact-related metrics. S-EDARA includes AR shapes identified by the tARget version 4 (ARS4) algorithm, strong integrated vapour transport (SIVT) indicators, and pseudo total precipitation rate (PTPR) fields. The dataset features both numerical data and interactive graphical catalogues displaying ARS4, SIVT, PTPR, gusty winds, and 24 h temperature changes at 6-hourly intervals. These enhancements enable more comprehensive analysis of AR impacts and characteristics, particularly for regions experiencing rapidly changing meteorological conditions during AR events. The dataset covers the period from 1940 to the present and is publicly available through the Federated Research Data Repository. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

14 pages, 717 KB  
Data Descriptor
In Situ Crop and Soil Data and UAV Imagery from Winter Wheat Fields in a Bulgarian Site
by Petar Dimitrov, Eugenia Roumenina, Georgi Jelev, Lachezar Filchev, Alexander Gikov, Ilina Kamenova, Iliana Ilieva, Dessislava Ganeva, Milena Kercheva, Martin Banov, Veneta Krasteva, Viktor Kolchakov, Emil Dimitrov and Nevena Miteva
Data 2026, 11(2), 35; https://doi.org/10.3390/data11020035 - 7 Feb 2026
Viewed by 615
Abstract
This data descriptor presents a dataset comprising crop and soil parameters measured in winter wheat fields near the town of Knezha, Bulgaria. The data were collected as part of a project evaluating the potential of vegetation indices derived from Sentinel-2 satellite imagery to [...] Read more.
This data descriptor presents a dataset comprising crop and soil parameters measured in winter wheat fields near the town of Knezha, Bulgaria. The data were collected as part of a project evaluating the potential of vegetation indices derived from Sentinel-2 satellite imagery to predict biophysical and biochemical crop parameters. The core dataset consists of measurements obtained from 20 m × 20 m field plots and includes a broad range of parameters: leaf area index, fraction of absorbed photosynthetically active radiation, vegetation cover fraction, chlorophyll content, above-ground biomass, plant nitrogen content, biological yield, surface soil moisture, spectral reflectance, plant density, crop height, visual assessments of disease or pest damage, and data on weed occurrence. The dataset is complemented by unmanned aerial vehicle imagery, crop calendars, and field management information. The main soil types in the study area were characterized through soil profiles, while meteorological data were obtained from an automated weather station. The data were collected during the 2016–2017 and 2017–2018 agricultural seasons. The dataset is freely available for download and serves as a valuable resource for researchers in remote sensing—particularly for validating satellite-derived products—as well as for specialists involved in winter wheat monitoring, modeling, and agronomic studies. Full article
(This article belongs to the Section Spatial Data Science and Digital Earth)
Show Figures

Figure 1

3 pages, 157 KB  
Data Descriptor
Normative Physical Fitness Profiles and Sex Differences in University Students of Sport Sciences: An Open Dataset of Anthropometrics, Flexibility, Strength, and Jump Performance
by Julio Martín-Ruiz and Laura Ruiz-Sanchis
Data 2026, 11(2), 34; https://doi.org/10.3390/data11020034 - 7 Feb 2026
Viewed by 541
Abstract
This Data Descriptor provides an open, anonymized dataset describing anthropometric and physical fitness outcomes in undergraduate students enrolled in a Physical Activity and Sport Sciences degree program. The dataset included 156 participants (28 females and 128 males) and reported sex, age, body mass, [...] Read more.
This Data Descriptor provides an open, anonymized dataset describing anthropometric and physical fitness outcomes in undergraduate students enrolled in a Physical Activity and Sport Sciences degree program. The dataset included 156 participants (28 females and 128 males) and reported sex, age, body mass, stature, and body mass index, alongside standardized field-based tests covering flexibility, muscular endurance, strength, and jump performance. Hip flexibility was assessed using the Thomas test on both sides. Trunk extensor endurance was measured using the Biering–Sørensen test, and upper-body strength–endurance was assessed using a dead-hang test. Upper limb strength was recorded as elbow flexion strength. Lower limb power was evaluated using vertical jump tests, including Abalakov, squat jump, and countermovement jump, and a derived indicator (IE) was provided to facilitate comparisons across jump modalities. The data are distributed as a machine-readable CSV file accompanied by a detailed data dictionary describing the variables, units, and missingness. The dataset is intended to support the reproducible reporting of normative fitness profiles in sports science students, facilitate teaching and benchmarking in exercise science contexts, and enable secondary analyses exploring associations between anthropometry and physical performance. For reproducible inferential comparisons, users may apply Welch’s two-sample t-test for sex-based differences. Full article
(This article belongs to the Special Issue Big Data and Data-Driven Research in Sports)
17 pages, 1299 KB  
Data Descriptor
Synthetic and Encoded Database of Dengue, Zika, Chikungunya, and Influenza Derived from the Literature
by Elí Cruz-Parada, Guillermina Vivar-Estudillo, Laura Pérez-Campos Mayoral, María Teresa Hernández-Huerta, Alma Dolores Pérez-Santiago, Carlos Romero-Diaz, Eduardo Pérez-Campos Mayoral, Iván Antonio García-Montalvo, Lucia Martínez-Martínez, Héctor Martínez-Ruiz, Idarh Matadamas, Miriam Emily Avendaño-Villegas, Margarito Martínez Cruz, Hector Alejandro Cabrera-Fuentes, Aldo Eleazar Pérez-Ramos, Eduardo Lorenzo Pérez-Campos and Carlos Mauricio Lastre-Domínguez
Data 2026, 11(2), 33; https://doi.org/10.3390/data11020033 - 6 Feb 2026
Viewed by 817
Abstract
This work presents a synthetic binary database of Dengue, Zika, Chikungunya, and Influenza constructed entirely from clinical information extracted from the scientific literature. Due to the limited availability and heterogeneity of clinical records in medical units—particularly for arboviral diseases—existing datasets are often insufficient [...] Read more.
This work presents a synthetic binary database of Dengue, Zika, Chikungunya, and Influenza constructed entirely from clinical information extracted from the scientific literature. Due to the limited availability and heterogeneity of clinical records in medical units—particularly for arboviral diseases—existing datasets are often insufficient for developing robust Machine Learning models. To address this limitation, an extensive search of PubMed and Google Scholar was conducted between February 2024 and May 2025, following strict selection criteria focused on diagnostic confirmation. The resulting dataset comprises 48,214 records and 67 standardized signs and symptoms, homogenized across all pathologies. Each record is fully binary, contains no missing values, and represents symptom presence or absence. The composition includes 22,379 Dengue records, 7135 Zika records, 7959 Chikungunya records, and 10,741 Influenza records. Symptom prevalence was analyzed, revealing consistency with patterns reported in epidemiological and clinical studies, supporting the dataset’s plausibility. This database enables statistical exploration and direct integration into Machine Learning pipelines without the need for imputation. It has been used in an in silico predictive study of arboviral diseases, employing Influenza as a negative control, and serves as a reproducible, literature-derived resource for computational modeling. Full article
Show Figures

Figure 1

12 pages, 1148 KB  
Data Descriptor
Psoriatic Arthritis (PsA) Clinical Lipidomics Dataset with Hidden Laboratory Workflow Artifacts: A Benchmark Dataset for Data Processing Quality Control in Lipidomics
by Jörn Lötsch, Robert Gurke, Lisa Hahnefeld, Frank Behrens and Gerd Geisslinger
Data 2026, 11(2), 32; https://doi.org/10.3390/data11020032 - 3 Feb 2026
Viewed by 475
Abstract
This dataset presents a real-world lipidomics resource for developing and benchmarking quality control methods, batch effect detection algorithms, and data validation workflows. The data originates from a cross-sectional clinical study of psoriatic arthritis (PsA) patients (n = 81) and healthy controls (n = [...] Read more.
This dataset presents a real-world lipidomics resource for developing and benchmarking quality control methods, batch effect detection algorithms, and data validation workflows. The data originates from a cross-sectional clinical study of psoriatic arthritis (PsA) patients (n = 81) and healthy controls (n = 26), matched for age, sex, and body mass index, which was collected at a tertiary university rheumatology center. Subtle laboratory irregularities were detected only through advanced unsupervised analysis, after passing conventional quality control and standard analytical methods. Blood samples were processed using standardized protocols and analyzed using high-resolution and tandem mass spectrometry platforms. Both targeted and untargeted lipid assays captured lipids of several classes (including carnitines, ceramides, glycerophospholipids, sphingolipids, glycerolipids, fatty acids, sterols and esters, endocannabinoids). The dataset is organized into four comma-separated value (CSV) files: (1) Box–Cox-transformed and imputed lipidomics values; (2) outlier-cleaned and imputed values on the original scale; (3) metadata including clinical classifications, biological sex, and batch information for all assay types and control sample processing dates; and (4) a variable-level description file (readme.csv). The 292 lipid variables are named according to LIPID MAPS classification and standardized nomenclature. Complete batch documentation and FAIR-compliant data structure make this dataset valuable for testing the robustness of analytical pipelines and quality control in lipidomics and related omics fields. This unique dataset does not compete with larger lipidomics quality control datasets for comparisons of results but provides a unique, real-life lipidomics dataset displaying traces of the laboratory sample processing schedule, which can be used to challenge quality control frameworks. Full article
Show Figures

Figure 1

18 pages, 2474 KB  
Data Descriptor
An Integrated Environmental and Perceptual Dataset for Predicting Comfort in Smart Campuses During the Fall Semester
by Gianni Tumedei, Chiara Ceccarini, Giovanni Delnevo and Catia Prandi
Data 2026, 11(2), 31; https://doi.org/10.3390/data11020031 - 3 Feb 2026
Viewed by 530
Abstract
Indoor environmental comfort plays a central role in occupants’ well-being, learning outcomes, and productivity, especially in educational buildings characterized by high occupancy variability and diverse activities. This paper presents a real-world dataset collected at the Cesena Campus of the University of Bologna, aimed [...] Read more.
Indoor environmental comfort plays a central role in occupants’ well-being, learning outcomes, and productivity, especially in educational buildings characterized by high occupancy variability and diverse activities. This paper presents a real-world dataset collected at the Cesena Campus of the University of Bologna, aimed at supporting occupant-centric comfort analysis and prediction in classrooms and laboratories. The dataset integrates continuous environmental measurements, such as temperature, humidity, noise, air pressure, and CO2 concentration, with subjective comfort feedback gathered from students during regular lectures. Data were collected using permanently installed ceiling sensors and additional control sensors placed near occupants, enabling both longitudinal monitoring and validation analyses. Furthermore, the dataset includes both repeated comfort perception reports and a one-time comfort definition phase capturing individual relevance weights for different comfort dimensions. By combining objective and subjective data in realistic academic settings, the dataset provides a valuable resource for developing, benchmarking, and validating data-driven models for smart campus applications, indoor comfort prediction, and human-centered building analytics. Full article
Show Figures

Figure 1

11 pages, 1038 KB  
Data Descriptor
Refined IDRiD: An Enhanced Dataset for Diabetic Retinopathy Segmentation with Expert-Validated Annotations and Comprehensive Anatomical Context
by Sakon Chankhachon, Supaporn Kansomkeat, Patama Bhurayanontachai and Sathit Intajag
Data 2026, 11(2), 30; https://doi.org/10.3390/data11020030 - 1 Feb 2026
Viewed by 883
Abstract
The Indian Diabetic Retinopathy Image Dataset (IDRiD) has been widely adopted for DR lesion segmentation research. However, it contains annotation gaps for proliferative DR lesions and labeling errors that limit its utility for comprehensive automated screening systems. We present Refined IDRiD, an enhanced [...] Read more.
The Indian Diabetic Retinopathy Image Dataset (IDRiD) has been widely adopted for DR lesion segmentation research. However, it contains annotation gaps for proliferative DR lesions and labeling errors that limit its utility for comprehensive automated screening systems. We present Refined IDRiD, an enhanced version that addresses these limitations through (1) expert ophthalmologist validation and correction of labeling errors in original annotations for four non-proliferative lesions (microaneurysms, hemorrhages, hard exudates, cotton-wool spots), (2) the addition of three critical proliferative DR lesion annotations (neovascularization, vitreous hemorrhage, intraretinal microvascular abnormalities), and (3) the integration of comprehensive anatomical context (optic disc, fovea, blood vessels, retinal region). A team of three ophthalmologists (one senior specialist with >10 years’ experience, two expert fundus image annotators) conducted systematic annotation refinement, achieving an inter-rater agreement F1-score of 0.9012. The enhanced dataset comprises 81 high-resolution fundus images with pixel-level annotations for seven DR lesion types and four anatomical structures. All images were cropped to the retinal region of interest and resized to 1024 × 1024 pixels, with annotations stored as unified grayscale masks containing 12 classes enabling efficient multi-task learning. Refined IDRiD enables training of comprehensive DR screening systems capable of detecting both non-proliferative and proliferative stages while reducing false positives through anatomical context awareness. Full article
Show Figures

Figure 1

18 pages, 1081 KB  
Data Descriptor
Controlled Generation of Synthetic Spanish Texts: A Dataset Using LLMs with and Without Contextual Retrieval
by José M. García-Campos, Agustín W. Lara-Romero, Vicente Mayor and Jorge Calvillo-Arbizu
Data 2026, 11(2), 29; https://doi.org/10.3390/data11020029 - 1 Feb 2026
Viewed by 679
Abstract
The increasing ability of Large Language Models (LLMs) to generate fluent and coherent text has heightened the need for resources to analyze and detect synthetic content, particularly in Spanish, where the scarcity of datasets hinders the development of reliable detection systems. This work [...] Read more.
The increasing ability of Large Language Models (LLMs) to generate fluent and coherent text has heightened the need for resources to analyze and detect synthetic content, particularly in Spanish, where the scarcity of datasets hinders the development of reliable detection systems. This work presents a Spanish-language dataset of 18,236 synthetic news descriptions generated from real journalistic headlines using a fully reproducible, open-source pipeline. The methodology used to produce the dataset includes both a Retrieval Augmented Generation (RAG) approach, which incorporates contextual information from recent news descriptions, and a NO-RAG approach, which relies solely on the headline. Texts were generated with the instruction-tuned Mistral 7B Instruct model, systematically varying temperature to explore the effect of generation parameters. The dataset includes detailed metadata linking each synthetic description to its source headline, generation settings, and, when applicable, retrieved contextual content. By combining contextual grounding, controlled parameter variation, and source-level traceability, this dataset provides a reproducible and richly annotated resource that supports research in Spanish synthetic text and evaluation of LLM-based generation. Full article
Show Figures

Figure 1

11 pages, 1353 KB  
Data Descriptor
Dual-Source Synthetic Uzbek Corpora for Sentiment Analysis and NER with Controlled Emoji Signals
by Bobur Saidov, Vladimir Barakhnin, Shohrux Madirimov, Umid Ibragimov, Shakhboz Meylikulov, Sultonbek Normamatov, Feruza Bahodirova, Javlonbek Matnazarov and Zarnigor Fayzullaeva
Data 2026, 11(2), 28; https://doi.org/10.3390/data11020028 - 1 Feb 2026
Cited by 1 | Viewed by 581
Abstract
This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a [...] Read more.
This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a controlled emoji distribution. The second corpus includes 3000 “manual-style” sentences designed to resemble short, naturally structured messages. Although the manual-style subset was initially intended to be emoji-free, the released version includes a 39.6% emoji presence (sentences containing at least one emoji) to maintain comparability in emotional markers across corpora. Both corpora are released in CSV, XLSX, and JSONL formats and share a unified schema (id, text, sentiment, entities, entity_type, polarity_score, polarity_source, token_count, emojis, emoji_position, emoji_sentiment, conflict_flag, sentiment_from_polarity_score, split). The dataset is publicly available via Mendeley Data (DOI: 10.17632/y2d5pcyrzz.3). Full article
Show Figures

Figure 1

10 pages, 1516 KB  
Data Descriptor
Multiplex Immunofluorescence and Histopathology Dataset of Cell Cycle–Related Proteins in Renal Cell Carcinoma
by Hazem Abdullah, In Hwa Um, Grant D. Stewart, Alexander Laird, Kathryn Kirkwood, Chang Wook Jeong, Cheol Kwak, Kyung Chul Moon, TranSORCE Team, Tim Eisen, Elena Frangou, Anne Warren, Angela Meade and David J. Harrison
Data 2026, 11(2), 27; https://doi.org/10.3390/data11020027 - 1 Feb 2026
Viewed by 740
Abstract
Clear-cell renal cell carcinoma (ccRCC) accounts for the majority of kidney cancer diagnoses and exhibits widely variable clinical behaviour. The dataset described here was generated to support the discovery of robust biomarkers of tumour cell-cycle arrest and to inform the risk-stratified management of [...] Read more.
Clear-cell renal cell carcinoma (ccRCC) accounts for the majority of kidney cancer diagnoses and exhibits widely variable clinical behaviour. The dataset described here was generated to support the discovery of robust biomarkers of tumour cell-cycle arrest and to inform the risk-stratified management of ccRCC. We assembled four independent cohorts including 480 patients from the UK arm of the SORCE adjuvant trial, 300 patients from a surgically treated series in Korea, 120 patients from a retrospective Scottish cohort, and a paired primary–metastatic cohort comprising 62 patients. Formalin-fixed paraffin-embedded nephrectomy specimens were processed for routine hematoxylin and eosin (H&E) histology, and for multiplex immunofluorescence (mIF). The mIF panels detect the cyclin-dependent kinase inhibitor p21CDKN1a, the DNA replication licencing factor MCM2, endoglin/CD105, Lamin B1 and nuclear DNA (Hoechst). Whole-slide images (WSIs) were acquired at high resolution, and artificial-intelligence pipelines were used to segment nuclei, classify individual cells into arrested phenotypes, and calculate the fraction of cells. Accompanying metadata include demographics, tumour stage, grade, Leibovich score, treatment arm (sorafenib/placebo), relapse events, and disease-free survival. All images and derived tables are released under a CC0 licence via the BioImage Archive, ensuring unrestricted reuse. This multi-cohort dataset provides a rich resource for studying cell-cycle arrest and proliferation markers, training image-analysis algorithms, and developing prognostic signatures in RCC. Full article
Show Figures

Figure 1

21 pages, 8306 KB  
Article
100 m Resolution Age-Stratified Population Grid Data for China Based on Township-Level in 2020
by Chen Liang, Keting Xiao, Shuimei Fu, Xun Zhou, Xinxin Chen, Mengdie Yang, Jiale Cai, Wenhui Liu, Xinqin Peng, Fuliang Deng, Wei Liu, Mei Sun, Ying Yuan and Lanhui Li
Data 2026, 11(2), 26; https://doi.org/10.3390/data11020026 - 1 Feb 2026
Viewed by 715
Abstract
China’s age structure is undergoing profound demographic shifts, making accurate spatial information on age-stratified populations essential for policy-making, resource allocation, and risk assessment. However, census data are primarily aggregated by administrative units, offering coarse spatial resolution that constrains their integration and application with [...] Read more.
China’s age structure is undergoing profound demographic shifts, making accurate spatial information on age-stratified populations essential for policy-making, resource allocation, and risk assessment. However, census data are primarily aggregated by administrative units, offering coarse spatial resolution that constrains their integration and application with other gridded datasets. Using township-level population counts for four age groups (0–14, 15–59, 60–64, and ≥65 years) from the 2020 Seventh National Population Census across 38,572 townships, we developed an age-stratified downscaling framework. This framework integrates a random forest model with age-filtered Points of Interest (POI) data and other multi-source geospatial covariates to generate a 100 m resolution age-stratified population density weighting layer. Through township-level data dasymetric mapping, we produced the township-based 100 m Age-Stratified Population Grid Data (Township-ASPOP). Since township-level data represent the finest publicly available spatial unit of demographic statistics in China, we further validated the accuracy of Township-ASPOP by generating County-based 100 m Age-Stratified Population Grid Data (County-ASPOP) through dasymetric mapping using county-level age-stratified population data. The results demonstrate that County-ASPOP achieves superior predictive accuracy, with R2 values of 0.95, 0.95, 0.85, and 0.86, and Root Mean Square Error (RMSE) values of 1743, 6829, 900, and 2033 persons per township for the four age groups, respectively—significantly outperforming the contemporaneous WorldPop dataset (R2 = 0.69, 0.72, 0.64, and 0.60). The accuracy of Township-ASPOP is no less than that of County-ASPOP and effectively captures realistic spatial settlement patterns. This study establishes a reproducible framework for generating age-stratified population grid data and provides critical data support for policy formulation and resource allocation. Full article
Show Figures

Figure 1

29 pages, 5368 KB  
Data Descriptor
TGEconomicDataset: A Collection of Russian-Language Economic Telegram Channels and a Synthetic Data Generation Framework for Continuous Authentication
by Elena Luneva, Pavel Banokin and Alexander Shelupanov
Data 2026, 11(2), 25; https://doi.org/10.3390/data11020025 - 28 Jan 2026
Viewed by 1469
Abstract
Telegram, along with WhatsApp and Signal, has become very popular due to its hybrid capabilities, including both instant private and public messaging, making it an effective tool for quickly broadcasting content to a wide audience. This article presents TGEconomicDataset, a new dataset containing [...] Read more.
Telegram, along with WhatsApp and Signal, has become very popular due to its hybrid capabilities, including both instant private and public messaging, making it an effective tool for quickly broadcasting content to a wide audience. This article presents TGEconomicDataset, a new dataset containing more than 2.9 million messages from the most popular Russian-language Telegram channels in the field of economics, as well as synthetically generated labeled mixtures of these channels. These mixtures are specifically designed to model authorship change scenarios for testing various methods for solving the problem of continuous authentication, which is of particular interest due to the need for organizations and companies to rely on data posted on social media. The presented dataset is enriched with quotes of important financial instruments such as gold futures, the USD/RUB currency pair, BRENT oil, the dollar index (DXY), and bitcoin (BTC), synchronized with the message timestamps. A detailed joint analysis of the collected data is provided. In addition to the presented dataset, we publish the scripts used to collect the data, integrate the financial indicators, and generate the synthetic mixtures for the continuous authentication task, ensuring full reproducibility of the research. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

14 pages, 670 KB  
Data Descriptor
Face Typicality–Distinctiveness Norms for the 304 Front-View Faces of the Glasgow Unfamiliar Face Database
by Paulo Ventura, Francisco Cruz and Susana Araújo
Data 2026, 11(2), 24; https://doi.org/10.3390/data11020024 - 26 Jan 2026
Viewed by 628
Abstract
Face typicality and distinctiveness are key facial attributes that influence face recognition performance and the formation of social impressions. The present study aimed to provide normative data for these dimensions, offering a useful resource for face recognition research. Using a 7-point Likert scale, [...] Read more.
Face typicality and distinctiveness are key facial attributes that influence face recognition performance and the formation of social impressions. The present study aimed to provide normative data for these dimensions, offering a useful resource for face recognition research. Using a 7-point Likert scale, adult participants rated 304 front-facing faces from the Glasgow Unfamiliar Face Database (GUFD) for typicality–distinctiveness. Results indicated that the subjective rating method produced reliable estimates, with meaningful variability observed along the typicality–distinctiveness continuum. Highly distinctive faces were more sparsely represented in the database. These norms can support principled stimulus selection and improved methodological control in empirical research with faces. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop