Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program

Varlamis, Iraklis

doi:10.3390/educsci15040500

Open AccessArticle

Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program

by

Iraklis Varlamis

Department of Informatics and Telematics, Harokopio University of Athens, 17778 Athens, Greece

Educ. Sci. 2025, 15(4), 500; https://doi.org/10.3390/educsci15040500

Submission received: 23 February 2025 / Revised: 6 April 2025 / Accepted: 14 April 2025 / Published: 16 April 2025

(This article belongs to the Special Issue Theory and Research in Data Science Education)

Download

Browse Figures

Versions Notes

Abstract

The increasing importance of data science in today’s world highlights the need to prepare students for the complexities of real-world data. This paper presents insights and findings from 15 years of teaching Data Mining and Business Intelligence in a Computer Science Master’s program, where a key component of the course is a semester-long assignment involving publicly available, messy, and often incomplete datasets. These datasets include examples such as publicly accessible datasets on accidents or fines from data.gov.uk, data from data contest platforms like Kaggle, and house rental data from platforms like Airbnb. Through these assignments, students are tasked with not only applying algorithmic tools but also addressing challenges like missing information, noisy inputs, and inconsistencies. They also learn the importance of finding and integrating supplementary open data sources to enhance the value and depth of their analyses. The primary objective of this approach is to enhance students’ problem-solving abilities by engaging them in complex, real-world data scenarios where they must navigate and resolve issues related to data quality and completeness. This approach cultivates critical skills such as data wrangling, preprocessing, and the extraction of meaningful insights, along with the ability to understand and articulate the business value of the data. Working hypotheses, such as the impact of data quality on analysis outcomes, are explored, and the paper demonstrates how addressing these challenges improves students’ decision-making processes in data-driven tasks. By engaging with real-world datasets, students develop resilience, adaptability, and problem-solving abilities, which are essential for navigating the complexities of data science in professional settings. This paper highlights the educational benefits of using messy data to bridge the gap between theoretical knowledge and real-world application while also demonstrating how this method explicitly improves students’ problem-solving and critical thinking skills in the context of data science.

Keywords:

data wrangling; real-world datasets; data science education; data preprocessing; problem-solving skills

1. Introduction

Data science has become a critical building block in higher education as industries increasingly rely on it to address complex challenges. This attention raises a new challenge for universities to prepare students for real-world applications, particularly when it comes to handling messy, unstructured datasets that dominate professional environments. In many academic settings, students often work with clean, well-structured datasets, which can misrepresent the true nature of data they will encounter in professional environments. Real-world data are typically messy—filled with inconsistencies, missing values, and noise—which demand a different set of skills than what traditional coursework offers (Schultheis & Kjelvik, 2020). These messy datasets present a unique set of challenges that require students to go beyond basic statistical or algorithmic knowledge and develop practical skills in data cleaning, wrangling, and integration. Without exposure to these challenges, students may struggle to apply theoretical knowledge to practical problems. This gap between the taught theory and the practical issues that data engineers face when using real-world datasets raises the question of how academic settings can better equip students for the challenges of real-world data science. Therefore, integrating messy, real-world datasets into academic assignments is critical in preparing students for the unpredictability of real-world data science tasks. Working with such data equips students with essential skills in data wrangling, problem solving, and decision making, ensuring that they are better prepared to extract meaningful insights and add business value when navigating complex, unstructured data in professional contexts (Kjelvik & Schultheis, 2019).

The core research question guiding this study is this: How can the integration of messy, real-world datasets into a data science curriculum support the development of practical data skills and critical thinking? This inquiry examines whether exposing students to complex, incomplete, and unstructured data can mirror the challenges of professional environments and encourage the application of practical solutions to data-driven problems. By incorporating real-world data, the course aims to foster an educational experience that is both authentic and reflective of industry practices.

The educational objectives of this course include fostering students’ problem-solving abilities through the challenge of navigating and managing imperfect data (Donoghue et al., 2021) and challenging their data wrangling and preprocessing skills, enabling them to effectively clean, transform, and structure data (Clare et al., 2019). Additionally, the curriculum is designed to stimulate critical thinking by prompting students to evaluate and select appropriate supplementary datasets, interpret their results, and assess the business value of the information they extract.

This study examines how an emphasis on practical data preprocessing of real-world data and exploratory data analysis reinforces the critical role of data quality in the overall analytical pipeline. By examining students’ engagement with messy real-world datasets, and reflecting on the outcomes of their project assignments, this study offers insights into how such exposure may contribute to the development of essential data skills. Through this descriptive analysis, this manuscript contributes to data science pedagogy by providing a nuanced account of integrating real-world data complexities into the curriculum.

The methodology used in this study involves designing and implementing a Master’s course that integrates technical and business perspectives through semester-long, hands-on assignments with real-world, messy datasets. These assignments emphasize the practical application of data preprocessing, cleaning, and feature engineering, and are intended to expose students to the challenges inherent in unrefined data. For this purpose, we employed real-world, messy datasets sourced from various dataset repositories (e.g., the UCI Machine Learning dataset repository, Kaggle repository, national open data repositories, etc.) and employed experiential learning approaches to evaluate student engagement and skill development. Additionally, we examined the role of open-source programming and visual tools (e.g., Python, Jupyter Notebooks, KNIME) in fostering practical skills. These tools allow students to experiment with data analysis, modeling, and visualization in a more dynamic and interactive manner. Additionally, the use of creative pedagogical methodologies, such as working in teams, presenting the results to an audience, and answering domain-specific questions, encourages critical thinking and collaborative problem solving. By fostering creativity and analytical reasoning, this approach not only strengthens students’ data science capabilities but also bridges the gap between theoretical knowledge and its application in real-world scenarios.

By integrating messy datasets and leveraging open-source tools, this study provides insights into bridging the gap between traditional curricula and real-world challenges, equipping students with critical, practical skills for professional success. The study underscores the necessity of developing not just technical knowledge but also practical skills such as the integration of multiple open data sources, tools, and algorithms, and the ability to interpret incomplete information and present the added value from data analysis to the decision makers. This hands-on approach equips students with the adaptability required for tackling real-world data challenges.

Section 2 follows with a literature review on the role of messy, real-world data in data science education, highlighting its relevance to skill development and grounding the discussion in experiential learning theories. Section 3 outlines the course design, student assignments, dataset selection, and assessment methods, explaining how student feedback shaped course evolution. Section 4 presents practical insights through case studies of successful assignments, showcasing the skills students developed and the challenges they faced. Section 5 connects these experiences to theoretical contributions, emphasizing critical thinking and data wrangling as essential skills. Finally, Section 6 discusses the broader pedagogical implications, and Section 7 concludes with a call for integrating real-world data into curricula.

2. Literature Review

Today, there is a growing interest in data-driven approaches in data science education, as real-world data are often messy, incomplete, and continuously evolving. Hohman et al. (2020) highlight the crucial role of data iteration in machine learning, emphasizing that practitioners frequently improve model performance by iterating on their data rather than their models. Data are constantly evolving, and through various types of iterations, such as adding new data, modifying features, and cleaning data, practitioners refine their models. This iterative process has a direct impact on model performance, demonstrating the importance of incorporating evolving and messy data into data science education to prepare future data scientists for the challenges of real-world applications. The same claim is supported by Jarrahi et al. (2023), who emphasize that the quality of data is fundamental to the performance of AI models. In the data-centric approach, the focus shifts from merely refining algorithms to enhancing the quality and structure of the data themselves. This involves various preprocessing techniques such as data cleaning, transformation, and augmentation to ensure that the data are representative and useful for training models.

Despite the increasing demands of data-driven approaches that consume real-world data, tertiary data science education still relies on preprocessed datasets for teaching basic theoretical concepts. As stated by Yan and Davis (2019), “In many academic courses students do exercises and practice on toy datasets to complete homework exercises and study for an examination to get a satisfactory grade”. Such datasets, often clean and well structured, can lead students to develop unrealistic expectations about data analysis in professional settings. Studies indicate that engaging with incomplete, noisy, or outlier-ridden datasets enhances students’ abilities in data cleaning, error handling, and critical thinking. For instance, students learn to navigate ambiguities and uncertainties inherent in real-world data, which fosters resilience and adaptability—key traits for future data scientists. Moreover, the ability to work with messy data prepares students to address practical challenges they will encounter in industry roles, where datasets rarely come preprocessed. By grappling with issues such as missing values and data inconsistencies, students develop a deeper understanding of the data life cycle, from collection to analysis, and improve their critical thinking through real-world paradigms. In the context of data science, critical thinking refers to the ability to navigate the complexities of messy, real-world datasets by identifying patterns, resolving ambiguities, and adapting analytical methods to varying scenarios. This process enables students to combine theoretical knowledge with practical problem-solving skills, preparing them to extract actionable insights and address real-world data challenges effectively. This is the reason why Yan and Davis (2019), Hicks and Irizarry (2018), and others agree on the delivery of introductory data science courses centered entirely on case studies, and provide examples of real-world datasets and topics that can be taught using them. In the following paragraphs, we focus on some works that highlight the importance of using incomplete and messy data in data science education.

The early work of Baumer (2015) was chosen due to its innovative integration of practical data science tasks within the curriculum. It describes an innovative undergraduate course designed to equip students with essential data science skills, emphasizing practical applications across the entire data analysis spectrum, from data acquisition to visualization and communication. This course presents a holistic approach that integrates statistics and computer science, allowing for flexibility in curriculum design while addressing modern data challenges. Rosenthal and Chung (2020) proposed a whole curriculum that integrates mathematics, information systems, and data science courses to enhance student success and retention through active learning and skill-building assignments. Their Introduction to Data Science course guides students through the data science process using practical projects on real, messy datasets and tools, fostering confidence and competence in tackling real-world data challenges. This article exemplifies innovative educational practices that address the growing demand for data literacy in today’s job market, making it highly relevant for educators and policymakers. However, while the curriculum is comprehensive, it may lack sufficient emphasis on interdisciplinary collaboration among students from diverse academic backgrounds, which could further enrich the learning experience and better prepare them for real-world applications.

The effects of the use of real data in teaching data science and statistics concepts have been outlined in the work of Lau et al. (2022), who reflect on their experiences launching a data science program at a large U.S. university and demonstrate their best practices. Their report presents three case studies highlighting the differing approaches of computer science and statistics instructors, ultimately suggesting ways to better integrate both perspectives to enhance data science education. One of the cases focuses on handling real-world data and highlights the current explorative trend among data scientists to start with readily available data and explore intriguing questions that the data may reveal. Despite the valuable case studies it provides, this article could benefit from a more extensive discussion on the scalability of these pedagogical strategies across different educational contexts, as well as how they might be adapted for diverse student populations.

Theoretical contributions to data science education draw heavily from established educational frameworks like Kolb’s Experiential Learning Theory (ELT); Kolb’s theory emphasizes the importance of learning through concrete experiences, reflective observation, abstract conceptualization, and active experimentation. In data science education, this framework is particularly valuable as students engage with messy, real-world data, requiring them to iterate between theory and hands-on practice (Allen, 2021). The importance of experiential learning in statistics courses has been discussed in (Taback, 2018), where the authors highlight the usefulness of data science competitions as a tool for applying theory to practice and a reason to work with real-world data. These competitions, through frameworks like the Common Task Framework (CTF), offer students hands-on experience with real-world data and predictive modeling, fostering problem-solving skills and practical application of statistical concepts.

Kolb’s Experiential Learning Theory (ELT) provides a foundational framework for this study, as it emphasizes the iterative cycle of concrete experience, reflective observation, abstract conceptualization, and active experimentation. This study properly aligns with the ELT framework by engaging students with messy data through hands-on tasks that challenge their existing knowledge and push them toward developing new, adaptive strategies for data wrangling and analysis.

Another related framework, problem-based learning (PBL), complements ELT by encouraging students to approach complex, ill-defined problems. This way, it mirrors the nature of data science tasks where students must identify patterns, clean data, and generate insights from raw information. The approach has been adopted by several researchers in an attempt to foster deeper engagement with real-world problems compared with traditional methods like MCQ-based assessments (Tanna et al., 2022), or has been combined with learning analytics to redesign existing courses and make them more data driven and student centered (Zotou et al., 2020). In our approach, PBL is incorporated to ensure that students move beyond theory and build practical problem-solving skills by directly addressing the ambiguities inherent in messy datasets.

Such theoretical frameworks foster critical thinking and analytical skills by pushing students to grapple with the uncertainty and variability of real datasets, rather than the neat, preprocessed data often used in traditional education. By engaging with messy data, students develop a deeper understanding of the data analysis process and learn to adapt their techniques to varying contexts, ultimately preparing them for the complex challenges they will face in professional environments.

Real-world datasets, such as those provided in data contests like the ones by Kaggle, offer invaluable opportunities for developing critical skills in data science. These datasets, often sourced directly from companies, present messy, unstructured data that challenge participants to engage in data wrangling, preprocessing, and error handling, and challenge the scalability of algorithms, which is crucial for handling large datasets in practice. Studies have shown that working with such datasets fosters resilience and adaptability, as students must learn to navigate incomplete, noisy, and non-ideal data conditions. This exposure not only enhances problem-solving and technical skills but also prepares students to design scalable, efficient algorithms that can tackle complex, real-world challenges in data science.

Lwakatare et al. (2020) identified key challenges in the development and maintenance of large-scale ML-based systems, particularly related to adaptability, scalability, privacy, and safety. They highlighted issues such as unstable data dependencies, noisy data, cold-start problems, and scaling challenges with model size and data while noting that privacy and safety solutions were the least addressed in the literature, pointing to significant areas for future research. Such problems are frequently evident in real-world data, and thus, the use of such data in the education process prepares students with all the necessary skills. This article was selected because it provides a comprehensive overview of the critical challenges faced by practitioners in the field of machine learning, making it a valuable resource for educators aiming to align their curricula with industry needs. However, while it effectively outlines various challenges, it could further enhance its impact by offering more detailed case studies or examples of successful implementations that have navigated these challenges, thereby providing actionable insights for educators and practitioners alike.

Finally, the more recent work of Zha et al. (2023) underscores the critical importance of data processing in the training of AI models, particularly as real-world datasets often contain issues such as errors in labels, imbalances, missing cases, and inherent biases. Key stages in the data construction process—such as labeling, augmentation, reduction, and preparation—are essential for addressing these challenges. For instance, while manual labeling methods are accurate, they are costly, prompting the need for more efficient strategies like semi-supervised and weakly supervised learning. Additionally, data augmentation techniques are crucial to mitigate data imbalance by creating diverse samples, and data reduction methods help optimize performance by simplifying large datasets. As datasets continue to grow in size and complexity, these data processing techniques will be increasingly important to ensure that AI models can effectively learn from real-world data while maintaining high performance and fairness.

3. Methodology

The Master’s course “Data Mining and Business Intelligence” is carefully structured to foster both technical and business skills, attracting students from computer science/IT backgrounds as well as those with expertise in technology management. The student population comprises both younger students who just graduated from computer science or IT management departments, and students around and above 30 who have an IT background and work experience and are interested in upgrading their skills and updating their knowledge with recent advances in computer science and information technology. The majority (>80%) of the students are male. Most of the students have a degree in computer science or IT (>70%), whereas the remaining have a first degree in Physics, Maths, Economics, or Management Sciences. The dual expertise of the students enriches the learning environment, allowing for a dynamic approach to problem solving where technical rigor meets business acumen. Computer science graduates typically bring proficiency in programming and data management, which equips them to handle complex, unstructured datasets. In contrast, technology management graduates contribute a strong understanding of business contexts, helping to interpret data insights in ways that align with real-world decision-making needs.

Figure 1 illustrates the course structure, which is organized into two sequential strands: Data Mining and Data Analysis. The Data Analysis strand focuses on setting up a data analysis pipeline for extracting business intelligence, whereas the Data Mining strand emphasizes the ad hoc processing of data to uncover meaningful patterns. Although they appear parallel, the two strands are executed sequentially as the course progresses. Initially, students are introduced to the basics of OLAP and data analytics to build foundational knowledge. This is followed by the second strand, which performs a deeper exploration of data mining methodologies and allows students to apply advanced algorithms in practice. Both strands involve three key activities: data processing, data transformation, and presentation of results (which includes visualization and interpretation), thus covering the typical steps of the data mining process.

Throughout the semester, students work on an integrative assignment that brings together tasks from both strands. In one part of the assignment, students develop a data analysis pipeline addressing key questions of a hypothetical decision maker. This process helps students experience the challenges of data cleaning, feature engineering, and data interpretation while demonstrating how raw data can be transformed into meaningful recommendations. Semester assignments play a key role in enabling students to apply theoretical knowledge in practical, hands-on ways, especially with messy, open-access datasets often drawn from data contests or open data repositories like data.gov.uk. Moreover, these assignments often reflect real-world scenarios, allowing students to engage with data that mirror the types of situations they may encounter in their professional lives. This connection to actual contexts enhances the learning experience by providing relevance and situational understanding. The inclusion of contextual relevance in the datasets encourages situated learning (Lave & Wenger, 1991), where students not only apply technical skills but also develop a deeper understanding of how data analysis is used to address specific challenges in diverse professional fields. Second, students undertake a data mining task, such as classification, regression, or clustering, which requires them to preprocess data and adapt them to algorithmic requirements. This practice is essential for understanding how critical preprocessing steps like normalization, handling missing values, and feature selection can enhance model quality and reliability.

Finally, students are required to integrate their findings, compare results, and present their outcomes in a final comprehensive report and presentation. This approach ensures that students gain both theoretical knowledge and practical skills, fostering their ability to tackle real-world data challenges effectively.

By working with unrefined, real-world datasets, students develop resilience and adaptability, skills that are essential in real-world data science roles. They gain experience in navigating the inconsistencies and imperfections common in large-scale datasets, learning how to address issues such as missing data and outliers. Additionally, through hands-on data mining tasks, they see firsthand how various preprocessing techniques can impact the accuracy and robustness of their models, reinforcing the importance of data quality in the analysis pipeline.

For student assignments, datasets are selected based on their suitability for real-world data mining tasks, their accessibility as open data, and the richness of attributes that define diverse analytical dimensions. The selection of a messy and real-world dataset is a key point of this course. The datasets are selected to be multi-dimensional; to depict a real-world problem, e.g. road safety, real estate, criminality, financial and societal fluctuations, etc.; and to be “imperfect” and potentially expandable with external datasets. Examples include open-access datasets from government agencies, like transportation and census data, and public health repositories, which often come in multiple files spanning different time periods and varying completeness levels. The datasets are chosen not for size but for their complexity and potential to support valuable, actionable insights for a hypothetical decision maker. Students are encouraged to refine and enhance data quality by integrating external datasets, adjusting granularity, and applying transformation techniques, which allows them to maximize dataset utility and practice comprehensive data preprocessing.

The assignment is designed to integrate both data analysis and data mining skills, requiring students to apply structured data processing while critically evaluating the value of the patterns they extract. Additionally, it encourages an adaptive approach to data processing, where students select and apply various data mining techniques—such as classification, clustering, and regression—based on the specific characteristics of the dataset and the problem at hand. The course’s evaluation process emphasizes both technical and analytical skills, with a strong focus on each team member’s unique contribution—whether technical or business oriented. Students are assessed on their ability to extract valuable insights from messy data, evaluating their problem-solving skills, creativity, and effective use of data cleaning and transformation techniques. For the data analysis component, students are graded on the meaningful conclusions they derive for decision making, while for data mining, the focus is on how effectively they address data quality issues to enhance model performance. This dual approach teaches students to appreciate and leverage data quality improvements, highlighting the real-world impact that handling messy data can have on outcomes.

At the end of each semester, students provide feedback on the course through a standardized evaluation process, which includes several questions that rate the course execution and a few open-ended questions, allowing them to reflect on the course structure, project execution, and their learning experience. Key challenges raised consistently relate to limited access to domain-specific external datasets that could significantly improve the dataset value, varying completeness, feature availability across years, and value incompatibility among datasets from different time periods. Additionally, some students have noted that larger dataset sizes made it challenging to complete certain data warehousing tasks within reasonable timeframes. These insights have informed course adaptations, such as pre-selecting datasets and pointing to external resources that could be employed to ensure greater consistency, facilitating early access to compatible external data, and offering guidance on handling larger datasets more efficiently to enhance their learning experience.

Over the past 15 years, we have consistently employed this methodology to ensure that students are equipped to tackle real-world data challenges. More than 150 group assignments have been completed and presented by students, helping them acquire valuable skills in analyzing data and presenting business insights in a clear and comprehensive manner. Additionally, more than 10 students have chosen to pursue similar projects for their diploma theses, working on various real-world data analysis tasks, often utilizing data from their professional environments. The findings presented here draw on patterns observed over all these years and reflecting diverse approaches to solving real-world data challenges.

While this work does not employ a formal research methodology, insights are derived from iterative reflections on assignment outcomes, systematic feedback from students, and continuous course refinement over several years. These processes implicitly shape the findings and provide a pathway for employing messy data in data education. This pathway can be summarized by the following actions:

State the objective and rationale behind the use of messy data: The primary objective is to introduce students to real-world, messy datasets and equip them with the necessary skills to handle, clean, analyze, and interpret such data. This is essential for fostering data literacy, preparing students for data science roles, and bridging the gap between technical knowledge and business acumen. The rationale lies in the importance of working with imperfect data, where students learn to extract actionable insights from noisy, incomplete, and inconsistent datasets.
Set dataset selection criteria: Datasets are selected based on their real-world relevance and inherent messiness, not size. The criteria include complexity (e.g., missing data, inconsistencies), domain diversity (e.g., transportation, health, business), and the ability to support meaningful analysis. Datasets must challenge students to address issues such as temporal inconsistencies, missing features, and data noise, encouraging them to refine data quality and practice data preprocessing techniques.
Design the course assignments: Assignments are designed to guide students through the full data analysis pipeline. Students engage in tasks like data cleaning, feature engineering, and model development, which are structured around real-world problem-solving scenarios. The assignments allow students to apply data mining techniques (e.g., classification, regression, clustering, affinity analysis) to messy data, with a focus on transforming raw data into valuable insights for decision making. This is a typical scenario in modern real-world data analytics, where we have the data and we must alternatively apply multiple techniques in order to extract value.
Teach data preprocessing and cleaning techniques: A core aspect of the methodology is the emphasis on data preprocessing. Students are taught how to handle common issues in messy datasets, such as missing values, outliers, and noise. Techniques like normalization, feature selection, and handling imbalanced datasets are central to improving model quality and ensuring robust analyses.
Assess both technical and analytical skills: Assessments focus on both the technical execution (data preprocessing, model implementation) and the analytical outcomes (quality of insights derived, decision-making relevance). Students are evaluated on how effectively they handle messy data, the accuracy of their models, and how well their insights address business problems.
Promote active learning and teamwork: Students learn by actively engaging with datasets, iterating on their work, and collaborating with teammates. By combining technical and business skills, students simulate real-world data science environments, where they practice problem solving and teamwork. This promotes deeper understanding and equips students with both technical and analytical expertise.
Incorporate continuous feedback and iterative improvement: Regular feedback is provided on both technical and analytical aspects, encouraging students to revise and improve their work. This iterative approach helps students refine their skills, address challenges faced during the assignments, and grow in their ability to work with messy data over time. Presenting project results to peers fosters a collaborative learning environment, allowing students to exchange ideas, receive diverse perspectives, and gain valuable insights into different approaches to solving problems.

4. Practical Insights and Case Studies

During all these years of the Master’s course implementation, students are assigned complex data analysis tasks to gain practical skills in managing real-world datasets. These assignments typically employ open-access datasets with incomplete or inconsistent information, requiring students first to improve data quality through data cleansing and enhancement and then to implement both analytical and data mining techniques, such as visualization, clustering, and classification. Through these assignments, students work to create a valuable data pipeline, develop insights for hypothetical decision makers, and employ data transformation techniques to manage dataset quality.

The methodology presented in the previous section consistently underpins all the case studies described in this section, providing students with a structured approach to tackling complex data analysis tasks. Each assignment begins by clearly stating the objective and rationale for using messy, real-world datasets, emphasizing the importance of preparing students for practical challenges in data science. The datasets are carefully selected based on their relevance, complexity, and diversity, ensuring that they offer opportunities for meaningful analysis despite inherent inconsistencies. Assignments are designed to guide students through the complete data analysis pipeline, from data cleaning and preprocessing to feature engineering and model development, with a focus on applying diverse analytical techniques to derive actionable insights. A strong emphasis is placed on teaching data preprocessing methods, such as handling missing values, normalization, and feature selection, to improve model robustness. Assessments evaluate both technical execution and the quality of analytical insights, ensuring that students develop the skills needed to address real-world data challenges effectively.

In one successful assignment, students worked with Chicago’s publicly available crime dataset1, which includes incident reports from 2001 onward. Here, students used data visualization tools and clustering techniques to identify patterns in crime data for specific timeframes and locations. They also developed a classifier predicting whether an arrest would follow a given incident, tackling data quality challenges by preprocessing and enhancing the dataset with additional geographic and financial data, such as neighborhood income levels. This added context enabled students to generate insights into crime patterns, such as how socio-economic factors influence crime in different areas. Tools like KNIME and Python were commonly used for preprocessing, and some students applied Oracle’s Analytic Workspace Manager for data warehousing and analytics tasks.

Another assignment required students to analyze Airbnb rental data2, with a focus on addressing real-world challenges such as missing values and variations in data across countries due to national regulations. Students began by formulating key questions to guide their analysis, including identifying the most and least expensive cities in certain countries, examining the relationship between rental price and the number of reviews per month, and understanding how accommodation category (e.g., entire home, shared room) influences pricing within cities. By calculating and visualizing metrics such as the average accommodation price per city and correlating these with guest reviews, students provided insights that could help professional real estate investors make informed decisions on property investments. They worked with geo-tagging tools and external data sources to capture spatial information, such as neighborhood boundaries, using tools like geojson.io for mapping and additional geospatial resources from KNIME’s extensions. This hands-on experience helped students confront common challenges in data wrangling, apply clustering and predictive analytics, and integrate multiple data sources to enrich the results of their analysis, ultimately leading to actionable insights for hypothetical investors on where high demand aligns with pricing, as well as areas where specific accommodation types might yield the best returns.

In another assignment, students were provided with datasets related to the economic performance of countries in the European Union, sourced from the World Bank3. The datasets included data on GDP, inflation, unemployment rates, and public spending across multiple years. The students were tasked with aligning and merging the datasets from various countries and analyzing specific indicators over time. They were asked to explore trends in economic growth, comparing Northern and Southern European countries or focusing on specific countries of interest. The analysis required not only statistical insights but also an understanding of the underlying economic policies that might influence these indicators. The key challenges students faced involved handling discrepancies in data format and completeness, addressing missing values (for example, dealing with missing GDP data for certain years), and choosing appropriate methods for comparing countries with significantly different economic structures. Beyond typical data preparation tasks such as cleaning and normalizing the data, students were encouraged to consider the broader economic context of each country, such as the impact of the financial crisis or EU policy changes. This deeper analysis was crucial for making informed comparisons and drawing meaningful conclusions from the data, encouraging students to develop both technical and contextual critical thinking skills in their approach to data science.

Through these assignments, students developed essential data handling skills, including data cleaning, preprocessing, and outlier detection, while also learning to navigate the complexities and ambiguities inherent in real-world data. By working in teams, they were encouraged to actively engage with the datasets, iterate on their approaches, and incorporate the diverse perspectives of their peers, mimicking the collaborative nature of professional data science environments. Students also gained hands-on experience with crucial data transformation techniques, enhancing their technical proficiency in preparation for advanced analytical tasks. Activities included filtering columns for accuracy, removing duplicates, and handling missing values, all of which emphasized the importance of high data integrity. Students applied transformations on columns like “price” to remove special characters and converted categorical variables to sparse matrices for efficient processing, giving them a deeper understanding of data structure optimization. They also learned discretization techniques to transform values into value ranges and prepared datasets for machine learning by splitting them into training and testing sets to evaluate model performance. Throughout the process, continuous feedback was provided, helping students refine their techniques and build confidence in managing messy datasets. Finally, they tuned their classifier parameters, sharpening their skills in model optimization and evaluation. These tasks collectively equipped students with practical expertise in data preparation and modeling—key capabilities for real-world data analysis and machine learning projects.

The open-ended nature of the assignments encouraged students to choose tools that best suited their skill sets, whether code-based (e.g., Python) or visual platforms (e.g., KNIME), allowing them to explore and showcase their strengths across various domains (e.g., see Figure 2 for an example of an external tool used for visualization). Mixed teams of business-oriented and computer science students provided a collaborative environment where technical skills and business insights complemented each other, with a strong emphasis on delivering actionable insights for decision makers. Regular discussions on their findings in the class served as an integral part of the learning process, fostering active learning and iterative improvement. These brief sessions and a final presentation of the findings to the whole class encouraged students to analyze constructive feedback from their peers and instructors, enabling them to refine their approaches and better communicate their results. In these discussions, students practiced abstract conceptualization by synthesizing feedback, identifying patterns in their data processing workflows, and applying those insights to improve their analysis. This approach not only strengthened technical proficiency but also underscored the importance of creating analyses that hold real-world value, sharpening students’ abilities to communicate results that can directly inform strategic decisions. By abstracting lessons from diverse assignments, such as aligning international datasets or correlating crime patterns with socio-economic factors, students developed transferable problem-solving skills essential for professional data analytics scenarios. Ultimately, the structured methodology promoted problem solving and teamwork, ensuring that students were well prepared for challenges they might face in professional data analytics scenarios.

4.1. Open Data Repositories

Open data repositories are invaluable resources for educational courses in data science, offering access to real-world but often messy datasets. These repositories enable students to work with data that reflect the challenges of missing information, inconsistencies, and noise, providing hands-on experience in data wrangling, cleaning, and analysis. Such repositories range from academic platforms like the Machine Learning Repository of the University of California at Irvine, which hosts datasets curated for teaching and research purposes, to governmental open data portals like data.gov, data.gov.uk, and data.gov.gr, which provide publicly accessible data on various domains such as healthcare, transportation, and public safety. Commercial repositories like Kaggle, Yahoo Finance, and CoinGecko focus on business and financial data, allowing students to analyze trends and generate actionable insights. These diverse resources expose students to a variety of data types, fostering critical thinking and problem-solving skills that are essential for real-world data science applications. Table 1 provides a summary of the data repositories that we examined over the years. Although the list is not exhaustive, it can be useful for other educators who wish to include real-world data in their courses.

4.2. Data Processing Tools

Data mining, preprocessing, and visualization tools are fundamental components of data science workflows, enabling practitioners to transform raw data into actionable insights. These tools support tasks such as cleaning messy data, exploring patterns, and communicating findings effectively. Graphical tools like KNIME, Orange Data Mining, and RapidMiner provide user-friendly interfaces for implementing algorithms and workflows, catering to users with varying levels of technical expertise. Visualization tools such as Tableau and QGIS enhance understanding through graphical representations of data, while specialized data cleaning tools like OpenRefine and Data Wrangler focus on preparing datasets for analysis by addressing inconsistencies, missing values, and formatting issues. The variety of available tools ensures that educators and learners can tailor their workflows to their technical needs and project requirements. Table 2 provides an overview of tools used in data science education and practice, emphasizing their versatility in supporting various stages of the data science process. These resources highlight the breadth of tools available to educators, researchers, and professionals, enabling them to address the multifaceted challenges of working with real-world data.

5. Theoretical Contributions to Data Science Education

The “Data Mining and Business Intelligence” course and the methodology it follows emphasize several critical factors that enhance students’ learning experiences and prepare them for real-world challenges in the field. By applying experiential learning principles, students engage with messy datasets that reflect the complexities of actual data scenarios, fostering a deeper understanding of data analysis beyond theoretical knowledge. The incorporation of messy data not only promotes critical thinking and essential data-wrangling skills but also emphasizes the need for these competencies in preparing future data scientists. Additionally, the course bridges the gap between academic training and industry expectations by exposing students to the practical realities of working with imperfect datasets. This comprehensive approach equips students with the necessary problem-solving skills and insights required to make meaningful contributions to their future careers in data science. More details on these important factors are provided in the subsections that follow.

5.1. Application of Experiential Learning

The course applied experiential learning principles by immersing students in hands-on assignments that involved active problem solving and decision making with real-world, messy datasets. Grounded in Kolb’s experiential learning theory, which suggests that deeper understanding arises when learners engage in tasks resembling real-life challenges (Kolb, 2014), the course guided students through all the four stages of experiential learning. Students gained concrete experience by addressing tasks like identifying and filling in missing data, transforming categorical and numerical variables, and creating meaningful visualizations. These practical experiences fostered reflective observation as students analyzed their approaches, examining successes and setbacks to refine their techniques. Through abstract conceptualization, students linked hands-on tasks to data science theories, developing a strong foundation in core competencies like data wrangling and analysis. Finally, the open-ended nature of these assignments encouraged active experimentation, pushing students to continuously adapt and apply their knowledge to new situations. This approach fostered a dynamic learning cycle that developed practical, enduring data science skills while instilling critical problem-solving abilities.

In extending the experiential learning methodology to real-time data contexts, one of the previous course assignments asked students to analyze and visualize open-access financial data from the CoinGecko (https://www.coingecko.com/api/documentations/v3—accessed on 10 April 2025) cryptocurrency API. This task provided students with concrete experiences involving live, volatile data, which required immediate adaptation as they worked with fluctuating and unpredictable values. Unlike static datasets, the cryptocurrency prices changed by the second, compelling students to reflect on the implications of capturing and interpreting data snapshots within a dynamic market.

Through iterative cycles of data analysis, students experienced Kolb’s learning stages—concrete experience, reflective observation, abstract conceptualization, and active experimentation—in real time. During the concrete experience stage, students actively engaged with the data, exploring different queries and API endpoints, allowing them to form a tangible connection to the real-time fluctuations they were analyzing. They observed the challenges and inconsistencies in the live data and reflected on how various factors like market events or API latency affected the data, considering the implications of those observations on their decision making. Students examined their thought processes and results, evaluating which strategies for data cleaning and visualization worked best and which areas required improvement. The abstract conceptualization (AC) stage followed as students began connecting their observations to broader concepts in data science. They linked the unpredictable nature of cryptocurrency prices to theoretical ideas of data uncertainty, market efficiency, and volatility. Students synthesized their hands-on experience with the theoretical concepts of data analysis they learned in class, such as statistical modeling and data forecasting, using these to guide their approach to handling the live data. This helped them build a more comprehensive understanding of how theoretical frameworks could be applied to dynamic, real-time contexts.

As a final project, one group developed an application (see Figure 3) that visualized real-time cryptocurrency price fluctuations, offering users comparative insights into trends across multiple currencies. By implementing the insights from their reflections and conceptualizations, students tested their models in a live environment, experimenting with different ways to present and analyze data to ensure accuracy and usability for the end users. This active experimentation with live data not only enhanced students’ technical skills in API integration and real-time data processing but also deepened their understanding of the responsiveness and agility required in fast-paced data environments.

5.2. Critical Thinking and Data Wrangling as Core Competencies

Introducing messy data into the curriculum encourages critical thinking, which is crucial in data science, as it prompts students to question assumptions, validate findings, and make informed decisions on data treatment. For example, students must critically evaluate the validity of outliers in a dataset, deciding whether they represent true anomalies or are artifacts of poor data collection methods. Real-world datasets, unlike clean academic datasets, require students to navigate ambiguity and make judgment calls that impact the integrity and applicability of their analyses. Students are challenged to determine the appropriate methods for handling conflicting data entries or missing values, weighing the trade-offs between deleting problematic entries and using imputation techniques. Data wrangling—the process of cleaning, structuring, and enriching raw data into a usable format—thus becomes more than a technical skill; it becomes a vital competency for data scientists who must constantly assess not only the accuracy but also the representativeness of the data they work with. Through tasks such as filtering, deduplication, transformation, and handling of missing values, students in this course gained a hands-on understanding of data wrangling, reinforcing its importance as a core competency for anyone entering the field of data science. Critical thinking in this context also involves reflecting on how data transformations might affect the interpretation of results, such as how scaling data or encoding variables may alter the outcome of machine learning models.

An assignment from a previous year tasked students with comparing accident-related datasets from India, Ethiopia, and the UK, using multiple sources to uncover differences in road safety and accident severity across these regions. By working with varied datasets from India’s National Data Portal4, the UK’s official road safety statistics5, and Kaggle datasets6, students learned to handle inconsistencies in data structure, format, and scope. They were required to assess the credibility and reliability of each source, making judgments on which dataset best represented the regional variations in road safety. This exercise not only enhanced their data-wrangling and integration skills but also provided insights into the cultural and infrastructural factors impacting road safety in each country. Students critically examined the socio-economic, political, and technological factors that could influence the quality of the data, questioning whether certain patterns were due to the data collection methods or reflected true societal differences. Through their analyses, students identified patterns, such as the higher severity of accidents in certain regions of India compared with that in the UK, which suggested differences in safety measures, public awareness, and traffic regulations. In doing so, students applied critical thinking to understand the broader implications of data, considering how data science could support improvements in road safety policies or public awareness campaigns in each country.

A group of students developed workflows in KNIME (see Figure 4) to (a) preprocess the datasets—cleaning, standardizing, and merging—and (b) analyze key variables like accident frequency, severity, and contributing factors. By engaging in comparative analysis, students moved beyond basic data manipulation to interpret findings in a broader socio-economic and cultural context. Students were asked to consider how regional differences in infrastructure, vehicle types, and traffic laws could influence accident rates, requiring them to evaluate how these external factors interact with the data they were analyzing. This helped them appreciate how data science can inform public policy and highlight cultural distinctions in attitudes toward road safety. Moreover, students reflected on how different statistical methods or models could lead to alternative conclusions, questioning whether their results would hold up under different assumptions or data subsets. This experience reinforced the value of including diverse, real-world datasets in the curriculum to build critical thinking and data interpretation skills that are essential in global data science applications. Through this process, students not only learned to manipulate data but also developed the ability to critically assess and draw meaningful conclusions from complex, multi-dimensional datasets.

5.3. Bridging the Classroom and Industry

Practical exposure to messy datasets effectively bridges the academic–industry gap, preparing students for the real-world demands of data science roles. According to CrowdFlower7, professional data scientists devote a significant amount of their time—often exceeding 50%—to collecting, labeling, and cleaning data. Integrating this essential yet frequently overlooked aspect into the curriculum provides students with a realistic preview of professional expectations, equipping them with skills that transfer directly to the workplace.

For instance, assignments involving the Chicago Crimes dataset required students to tackle incomplete and inconsistent data, handle missing values, and classify incidents based on crime type, preparing them for similar complexities in fields like public safety analytics. Likewise, the Airbnb rental data analysis introduced challenges such as merging datasets of varying structures and completeness, assessing average rental prices by city, and evaluating the relationship between price and review frequency.

In another assignment, students analyzed real environmental sensor data from an office installation in the university, capturing metrics such as temperature, humidity, luminosity, and energy consumption across various devices. The dataset enabled students to explore the impact of workdays versus weekends, as well as seasonal and weather-based variations, such as holiday periods versus semester weeks, winter versus summer, and cloudy versus sunny days. To uncover meaningful usage patterns, students abstracted the data into transactional form and applied association rule extraction techniques, revealing frequent energy usage patterns that highlighted behavioral differences in energy consumption. Furthermore, students used Long Short-Term Memory (LSTM) models to analyze time-series data, predicting trends in energy usage and helping them identify patterns that could inform energy conservation strategies. As part of the analysis, students employed clustering to detect different operational states—specifically distinguishing between normal and standby modes of an A/C unit (see Figure 5). This hands-on assignment allowed students to apply advanced machine learning techniques to real-world data, enhancing their skills in time-series forecasting, anomaly detection, and energy analytics. Through these practical applications, students gained experience in transforming raw data into actionable insights, equipping them with essential competencies for data science careers.

All the aforementioned real-world tasks gave students the opportunity to develop crucial problem-solving and data-wrangling skills, mirroring the everyday challenges they will face in industry, where datasets are rarely clean and ready for analysis.

5.4. Promoting Industry-Ready Problem-Solving Skills

The use of messy, real-world datasets in data science education fosters an environment where students learn to tackle industry-like problems, equipping them with skills for effective and efficient data-driven decision making. Preliminary observations from the course suggest that engagement with such datasets may contribute to a deeper understanding of the critical role of data quality in the analytical pipeline. By addressing missing, inconsistent, or noisy data, students have opportunities to refine their problem-solving abilities and appreciate the importance of robust data preprocessing in professional settings.

6. Discussion and Implications

The course’s major findings reinforce the value of using real-world datasets and decision-making problems in teaching data science, demonstrating that hands-on exposure to incomplete and inconsistent data fosters a deeper understanding of critical data-wrangling skills and real-world problem solving. Through assignments with open datasets that correspond to real-world incidents, students actively engaged in cleaning, transforming, and analyzing data that mirrored professional challenges. This experiential approach allowed students to improve their technical competencies, like data preprocessing and feature engineering, but also strengthened their critical thinking and decision-making skills, keeping their creativity and autonomy at a high level. Working with imperfect data allowed students to apply theoretical knowledge in practical, ambiguous scenarios, moving beyond static classroom examples and developing an intuitive understanding of the complexities inherent in real-world datasets.

These findings suggest that data science curricula could benefit from a broader integration of experiential learning principles, particularly in exposing students to unstructured, messy data as part of their core training. Instructors aiming to replicate this approach should consider designing real-world, open-ended assignments that mirror industry tasks and require students to make meaningful decisions about data processing and analysis. These assignments should encourage students to act as decision makers—rather than passive analysts—and build domain-specific knowledge, which enables them to use data as a powerful tool for solving real-world problems. This approach is especially useful when students must decide how to treat data gaps, outliers, and inconsistencies, encouraging them to consider the impact of each step on the analysis’s overall quality and integrity.

Additionally, interdisciplinary collaboration enhances the benefits of experiential learning by merging diverse perspectives, such as mixing students from business and technical fields. This setup mimics real-world work environments, where data scientists often collaborate with professionals across domains. Through interdisciplinary teamwork, students gain a holistic understanding of how data science impacts various fields, learning to communicate insights to non-technical stakeholders, translate complex analysis into actionable outcomes, and refine their problem-solving approaches to meet practical needs. Curriculum changes like these better prepare students for the realities of data science roles, where clean, well-structured data are rare, and complex data wrangling is essential.

Observations from our course experience support the notion that real-world, messy datasets can be a valuable tool in preparing students for professional data science challenges. Unlike prior works that focus on structured datasets for conceptual clarity (e.g., Yan and Davis (2019)), our experience underscores the need for iterative data refinement, as highlighted by Hohman et al. (2020) and Jarrahi et al. (2023). Moreover, while studies such as Rosenthal and Chung (2020) have advocated for the curriculum-wide integration of data science principles, our approach emphasizes hands-on engagement with ambiguous and incomplete datasets to cultivate critical thinking. This aligns with experiential learning models like Kolb’s ELT, where iterative problem solving enhances students’ adaptability. By bridging these perspectives, our work contributes to a growing discourse on the pedagogical benefits of exposing students to data complexities that mirror real-world industry demands.

6.1. Limitations of the Study

The current study has several limitations that should be acknowledged to provide a balanced perspective on its findings. First, while this study draws from several years of course implementation, the empirical data supporting its claims are limited. The evaluation process relied primarily on student satisfaction surveys rather than a framework designed to assess learning outcomes or critical thinking development. Although the course consistently received high ratings and accolades, the lack of detailed qualitative and quantitative evidence from student performance makes it challenging to provide a direct comparison with other quotes or testimonials that could better illustrate the observed benefits.

Additionally, the study currently reflects the main observations and finding on the course structure and outcomes, supported by selected examples. However, a more systematic evaluation of student performance in the theoretical concepts of the course in the data mining and data analysis tasks is still missing. It is a subject of our next research steps to organize a study and quantitatively evaluate how the course design and assignment structure influence student performance or engagement. Although we aim to emphasize the positive aspects of the course, we acknowledge that any educational approach may yield both positive and negative outcomes depending on the context and individual student experiences.

The demographics of the student population, though now briefly addressed, could have been more thoroughly explored to assess the collaboration dynamics between business and IT students. Future work will include a more comprehensive analysis of how different student backgrounds may influence learning outcomes and group interactions. These limitations highlight the need for more rigorous evaluation methodologies in subsequent studies and for aligning the course framework with research-based best practices. Finally, while we do not claim that our findings can be generalized universally, we acknowledge that the context of our study is specific, and the results may not apply in all educational settings. We have explicitly discussed these limitations in this subsection to provide a reflective account of the course’s impact.

6.2. Future Research Steps

Looking forward, future research could investigate the long-term effects of working with messy data on students’ career readiness, including their resilience, adaptability, and ability to handle complex, real-world data scenarios. In forthcoming studies, we plan to implement formal evaluation frameworks—such as pre- and post-course assessments, control group comparisons, and standardized rubrics—to systematically measure changes in technical competence and soft skills over time. Examining this area further could provide valuable insights into how learning with messy data builds foundational skills in data science students, equipping them with the perseverance needed to complete projects despite challenging data constraints. Additionally, exploring messy data exposure across sectors such as healthcare, finance, or environmental science could provide insights into how different data types and complexities shape learning outcomes. For instance, healthcare data might expose students to issues related to data privacy, while financial data could introduce them to volatility and time-sensitive analysis requirements.

Moreover, we intend to integrate mixed-methods research approaches, combining quantitative assessments with qualitative feedback (e.g., student interviews, focus groups, and reflective journals) to capture a comprehensive picture of the educational impact. This research could refine best practices in experiential data science education, ensuring that graduates possess essential technical skills alongside critical soft skills, such as problem-solving agility, effective communication, and collaborative teamwork—qualities that are increasingly demanded in today’s dynamic data-driven workplace. Future studies will also examine the scalability and adaptability of our curriculum across diverse institutional contexts, thereby establishing benchmarks for best practices in data science education. Furthermore, educators could benefit from studying the balance between technical and soft skill development in data science curricula, ultimately creating programs that help students develop a well-rounded skill set that integrates analytical rigor with adaptability and communication skills needed for the workforce.

7. Conclusions

The use of messy, real-world datasets in data science education fosters an environment where students learn to tackle industry-like problems, equipping them with skills for effective and efficient data-driven decision making. Our findings show that students who engage with such datasets develop a deeper understanding of the critical role of data quality in the overall analytical pipeline. By actively addressing missing, inconsistent, or noisy data, they refine their problem-solving abilities and reinforce the necessity of robust data preprocessing—an essential skill in professional settings.

The course structure was designed to emphasize the relevance of data insights for hypothetical decision makers, encouraging students to approach each assignment with a focus on producing meaningful, actionable results. Observations from project presentations indicate that students were able to articulate business recommendations based on their analyses. This industry-aligned problem-solving mindset, coupled with the integration of business and technical skills in team-based projects, appeared to support the translation of data wrangling and analysis into insights that could impact organizational strategy. Additionally, the integration of business context within assignments seemed to help students identify real-world challenges in data interpretation, leading to more refined data models and strategic recommendations. In an educational setting, this holistic approach demonstrates that, beyond technical skills, successful data scientists must understand data’s relevance within broader business contexts, ultimately setting students on a path toward meaningful contributions in their future careers.

The hands-on experience with messy datasets also builds resilience and adaptability—traits that are vital in professional environments. By engaging with real-world data from the outset, students develop a strong capacity to analyze and preprocess raw data thoughtfully, recognizing that data preparation is often the most crucial stage for producing meaningful, actionable insights. By facing and overcoming real-world data challenges, students develop a stronger capacity to analyze and preprocess raw data thoughtfully, ensuring that they are prepared to add value in data-driven organizations. This conclusion is based on qualitative observations of student performance and feedback during the course. Specifically, students were consistently exposed to incomplete, inconsistent, or ambiguous datasets, mirroring challenges encountered in professional environments. Part of their project work involved identifying actionable insights from data, presenting knowledge potentially useful to decision makers, and arguing how this knowledge could be applied in practice (e.g., for improving road safety or investing in real-estate projects). Students’ ability to confidently manage data preparation phases also positively impacted their project outcomes, with a higher rate of successful model deployment in team-based projects. They gain confidence in making informed decisions, understanding that data cleaning and preparation are not just preliminary steps but central to ensuring the reliability of any subsequent analysis. This realization underscores the value of structured preprocessing and exploratory data analysis as foundational components of successful data-driven decision making. Although this claim primarily stems from qualitative insights rather than quantitative data, it aligns with the successful completion of complex, hands-on assignments. Consequently, students can better manage the full life cycle of data analysis projects, a quality that is essential for data scientists expected to handle complex, dynamic datasets independently.

By aligning the course’s outcomes with the methodological framework, we ensure that the development of students’ ability to work with messy data, analyze them effectively, and present them within a business context is directly linked to the structured approach of the study. Moreover, incorporating a business context into data science assignments promotes the development of critical soft skills, such as effective communication, strategic thinking, and collaboration. As students work with real-world data to derive insights that could be valuable to decision makers, they learn to convey their findings in ways that resonate with both technical and non-technical stakeholders. By placing their analytical work within a larger organizational context, students not only enhance their technical abilities but also refine their capacity to interpret and articulate data-driven insights in a way that informs broader business decisions. This skill—the ability to frame data analysis within strategic goals—is a differentiator for data science graduates entering the workforce, where the value of data insights lies in their practical application.

The curriculum’s emphasis on interdisciplinary collaboration also mirrors the collaborative nature of real-world data science teams, where diverse expertise is essential for robust problem solving. The assessment of projects showed that interdisciplinary team-based projects led to more innovative solutions, with an obvious increase in their ability to integrate diverse perspectives into problem-solving processes. Team-based projects that bring together students from different academic backgrounds enhance learning outcomes, as each student must consider varying perspectives and learn to address multiple objectives through data analysis. These collaborative experiences are invaluable, as they enable students to approach data science problems in innovative ways, often blending technical and strategic considerations that reflect real industry scenarios. In doing so, students become more versatile and effective professionals, capable of integrating domain knowledge with data science principles.

In summary, this approach to data science education—which emphasizes real-world data, critical decision making, interdisciplinary teamwork, and a business-oriented mindset—appears to contribute to the development of graduates who are not only technically proficient but also adaptable, communicative, and strategic. Preliminary empirical observations from this course demonstrate that such an approach significantly improves students’ readiness for data science careers, as evidenced by higher post-graduation employment rates and positive industry feedback on the graduates’ ability to handle complex, real-world data challenges. While these initial insights are promising, further validation is needed to fully establish the impact of this educational approach on professional outcomes.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the datasets used in the course are open, and their URLs are listed in the Notes section below.

Conflicts of Interest

The authors declare no conflicts of interest.

Notes

1	https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2 (accessed on 22 February 2025).
2	https://insideairbnb.com/get-the-data/ (accessed on 22 February 2025).
3	https://data.worldbank.org/country (accessed on 22 February 2025).
4	https://www.data.gov.in/catalog/road-accidents-india-2018 (accessed on 22 February 2025).
5	https://www.gov.uk/government/collections/road-accidents-and-safety-statistics (accessed on 22 February 2025).
6	https://www.kaggle.com/datasets/saurabhshahane/road-traffic-accidents (accessed on 22 February 2025).
7	https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf (accessed on 22 February 2025).

References

Allen, G. I. (2021, March 13–20). Experiential learning in data science: Developing an interdisciplinary, client-sponsored capstone program. 52nd ACM Technical Symposium on Computer Science Education (pp. 516–522), Virtual Event. [Google Scholar]
Baumer, B. (2015). A data science course for undergraduates: Thinking with data. The American Statistician, 69(4), 334–342. [Google Scholar] [CrossRef]
Clare, J. D., Townsend, P. A., Anhalt-Depies, C., Locke, C., Stenglein, J. L., Frett, S., Martin, K. J., Singh, A., Van Deelen, T. R., & Zuckerberg, B. (2019). Making inference with messy (citizen science) data: When are data accurate enough and how can they be improved? Ecological Applications, 29(2), e01849. [Google Scholar] [CrossRef] [PubMed]
Donoghue, T., Voytek, B., & Ellis, S. E. (2021). Teaching creative and practical data science at scale. Journal of Statistics and Data Science Education, 29(Suppl. 1), S27–S39. [Google Scholar] [CrossRef]
Hicks, S. C., & Irizarry, R. A. (2018). A guide to teaching data science. The American Statistician, 72(4), 382–391. [Google Scholar] [CrossRef] [PubMed]
Hohman, F., Wongsuphasawat, K., Kery, M. B., & Patel, K. (2020, April 25–30). Understanding and visualizing data iteration in machine learning. 2020 CHI Conference on Human Factors in Computing Systems (pp. 1–13), Honolulu, HI, USA. [Google Scholar]
Jarrahi, M. H., Memariani, A., & Guha, S. (2023). The principles of data-centric AI. Communications of the ACM, 66(8), 84–92. [Google Scholar]
Kjelvik, M. K., & Schultheis, E. H. (2019). Getting messy with authentic data: Exploring the potential of using data from scientific research to support student data literacy. CBE—Life Sciences Education, 18(2), es2. [Google Scholar] [CrossRef] [PubMed]
Kolb, D. A. (2014). Experiential learning: Experience as the source of learning and development. FT press. [Google Scholar]
Lau, S., Nolan, D., Gonzalez, J., & Guo, P. J. (2022, March 2–5). How computer science and statistics instructors approach data science pedagogy differently: Three case studies. 53rd ACM Technical Symposium on Computer Science Education—Volume 1 (pp. 29–35), Providence, RI, USA. [Google Scholar]
Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. Cambridge university press. [Google Scholar]
Lwakatare, L. E., Raj, A., Crnkovic, I., Bosch, J., & Olsson, H. H. (2020). Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Information and software technology, 127, 106368. [Google Scholar] [CrossRef]
Rosenthal, S., & Chung, T. (2020, March 11–14). A data science major: Building skills and confidence. 51st ACM Technical Symposium on Computer Science Education (pp. 178–184), Portland, OR, USA. [Google Scholar]
Schultheis, E. H., & Kjelvik, M. K. (2020). Using messy, authentic data to promote data literacy & reveal the nature of science. The American Biology Teacher, 82(7), 439–446. [Google Scholar]
Taback, N. (2018, July 8–13). Do you have experience? Incorporating experiential learning opportunities into statistics education is messy but important. Proceedings of the 10th International Conference on Teaching Statistics. International Statistical Institute, Kyoto, Japan. Available online: http://iase-web.org/icots/10/proceedings/pdfs/icots10_10a2.pdf (accessed on 10 April 2025).
Tanna, P., Lathigara, A., & Bhatt, N. (2022). Implementation of problem based learning to solve real life problems. Journal of Engineering Education Transformations, 35(S1), 103–111. [Google Scholar] [CrossRef]
Yan, D., & Davis, G. E. (2019). A first course in data science. Journal of Statistics Education, 27(2), 99–109. [Google Scholar] [CrossRef]
Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F., & Hu, X. (2023, April 27–29). Data-centric AI: Perspectives and challenges. 2023 SIAM International Conference on Data Mining (SDM) (pp. 945–948), St. Paul Twin Cities, MN, USA. [Google Scholar]
Zotou, M., Tambouris, E., & Tarabanis, K. (2020). Data-driven problem based learning: Enhancing problem based learning with learning analytics. Educational Technology Research and Development, 68(6), 3393–3424. [Google Scholar] [CrossRef]

Figure 1. The structure and activities of the MSc course.

Figure 2. A visualization of the apartment prices in the city of Bologna using leaflet.js.

Figure 3. Visualization of comparison of cryptocurrency prizes, developed using JavaScript and Vega-Lite (https://vega.github.io/vega-lite/—accessed on 10 April 2025) framework.

Figure 4. The workflows developed by students to preprocess (top) and analyze (bottom) data.

Figure 5. The result of a clustering process over the A/C power consumption sensor, which detects periods of normal and standby.

Table 1. Overview of open data repositories for data science education.

Repository Name	URL	Type of Data	Source	# of Datasets
Kaggle	https://www.kaggle.com	Varied (e.g., finance, healthcare)	Commercial	68,000+
World Bank	https://data.worldbank.org	Global development metrics	Governmental	2000+
Data.gov	https://www.data.gov	Varied (e.g., transportation, health)	Governmental (US)	300,000+
Data.gov.uk	https://data.gov.uk	Varied (e.g., education, safety)	Governmental (UK)	50,000+
Data.gov.gr	https://www.data.gov.gr	Varied (e.g., economy, energy)	Governmental (Greece)	74
UCI ML Repository	https://archive.ics.uci.edu/ml/index.php	Machine Learning datasets	Academic	674
CoinGecko	https://www.coingecko.com/	Cryptocurrencies	Commercial	16,000+
Yahoo Finance	https://finance.yahoo.com/quote/API/	Stock market (not free)	Commercial	N/A
Zenodo	https://zenodo.org/	General research data	Academic	390,000+
Eurostat	https://ec.europa.eu/eurostat/	National Statistics, Visualizations	Governmental	over 1.5 million
United Nation Data	https://data.un.org/	National Statistics	Governmental	32 databases
Healthdata.gov	https://healthdata.gov/	Public health datasets	Governmental	3000+
Google Dataset Search	https://datasetsearch.research.google.com/	Datasets from various sources	Commercial	N/A

Table 2. Overview of data mining, preprocessing, and visualization tools.

Name	URL	Type of Tool	Open Source or Commercial	Task
KNIME	https://www.knime.com	Workflow-based	Open Source	Preprocessing, Data Mining, Analytics, Visualization
Orange Data Mining	https://orange.biolab.si	Workflow-based	Open Source	Data Mining, Visualization
RapidMiner	https://www.rapidminer.com	Workflow-based	Commercial	Preprocessing, Data Mining, Analytics
QGIS	https://www.qgis.org	Menu-based	Open Source	Visualization
Tableau	https://www.tableau.com	Menu-based	Commercial	Visualization
Power BI	https://powerbi.microsoft.com	Menu-based	Commercial	Analytics, Visualization
OpenRefine	https://openrefine.org	Menu-based	Open Source	Preprocessing
Data Wrangler	https://wrangler.io	Menu-based	Commercial	Preprocessing
Python	https://python.org/	Code-based	Open Source	Preprocessing, Data Mining
R	https://www.r-project.org/	Code-based	Open Source	Preprocessing, Data Mining
Matplotlib	https://matplotlib.org	Code-based	Open Source	Visualization
ggplot2	https://ggplot2.tidyverse.org	Code-based	Open Source	Visualization

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Varlamis, I. Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program. Educ. Sci. 2025, 15, 500. https://doi.org/10.3390/educsci15040500

AMA Style

Varlamis I. Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program. Education Sciences. 2025; 15(4):500. https://doi.org/10.3390/educsci15040500

Chicago/Turabian Style

Varlamis, Iraklis. 2025. "Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program" Education Sciences 15, no. 4: 500. https://doi.org/10.3390/educsci15040500

APA Style

Varlamis, I. (2025). Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program. Education Sciences, 15(4), 500. https://doi.org/10.3390/educsci15040500

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Messy Data in Education: Enhancing Data Science Literacy Through Real-World Datasets in a Master’s Program

Abstract

1. Introduction

2. Literature Review

3. Methodology

4. Practical Insights and Case Studies

4.1. Open Data Repositories

4.2. Data Processing Tools

5. Theoretical Contributions to Data Science Education

5.1. Application of Experiential Learning

5.2. Critical Thinking and Data Wrangling as Core Competencies

5.3. Bridging the Classroom and Industry

5.4. Promoting Industry-Ready Problem-Solving Skills

6. Discussion and Implications

6.1. Limitations of the Study

6.2. Future Research Steps

7. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI