Predicting Employee Absence from Historical Absence Profiles with Machine Learning

Zupančič, Peter; Panov, Panče

doi:10.3390/app14167037

Open AccessArticle

Predicting Employee Absence from Historical Absence Profiles with Machine Learning

by

Peter Zupančič

^1,2,*

and

Panče Panov

³

¹

Faculty of Information Studies, 8000 Novo mesto, Slovenia

²

1A Internet, d.o.o., 8270 Krško, Slovenia

³

Jožef Stefan Institute, 1000 Ljubljana, Slovenia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7037; https://doi.org/10.3390/app14167037

Submission received: 10 May 2024 / Revised: 30 July 2024 / Accepted: 31 July 2024 / Published: 11 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

In today’s dynamic business world, organizations are increasingly relying on innovative technologies to improve the efficiency and effectiveness of their human resource (HR) management. Our study uses historical time and attendance data collected with the MojeUre time and attendance system to predict employee absenteeism, including sick and vacation leave, using machine learning methods. We integrate employee demographic data and the absence profiles on timesheets showing daily attendance patterns as fundamental elements for our analysis. We also convert the absence data into a feature-based format suitable for the machine learning methods used. Our primary goal in this paper is to evaluate how well we can predict sick leave and vacation leave over short- and long-term intervals using tree-based machine learning methods based on the predictive clustering paradigm. This paper compares the effectiveness of these methods in different learning settings and discusses their impact on improving HR decision-making processes.

Keywords:

machine learning in human resource management; predictive modelling; tree-based methods; feature engineering; multi-target prediction; employee absence prediction; data-driven human resources decision making

1. Introduction

In today’s rapidly evolving business landscape, employers in various sectors are increasingly turning to advanced technologies to improve their operational efficiency. The main objective is to increase the productivity of human resource management and optimize personnel expenditure. Organizations are particularly focused on gaining valuable insight from the vast amounts of data they are accumulating in the field of human resources management (HRM). This data-driven approach is important not only for day-to-day operational support and decision-making but also for compliance with national and international regulatory standards.

In today’s world, HR leaders have moved from making decisions reactively based on static reports to a more dynamic approach that integrates business and people data to predict future outcomes. This evolution is facilitated by the use of sophisticated dashboards that enable managers to identify patterns, anticipate organizational needs and detect anomalies [1]. Advanced forecasting tools, including what-if simulations, play a crucial role in this context. They enable HR departments to preemptively address potential changes in employee behavior and align HR strategies with desired business outcomes. Such strategic measures are essential for measuring employee performance, improving engagement, examining collaboration patterns, analyzing turnover, and modelling the lifetime value of employees [2].

At the heart of effective HR management is the ability to forecast and plan. HR departments have the important task of supporting other departments by ensuring that the right people are available to meet the needs of the organization. Accurate forecasting is critical to aligning staff availability with the needs of the business to optimize staff and resource allocation [3].

The use of predictive analytics to forecast employee absences offers significant strategic advantages. By enabling proactive organization and staffing, these tools help companies anticipate and resolve staff shortages before they disrupt operations. Predictive insights enable strategic staff redeployment, helping to balance workloads and maintain productivity across all departments. This proactive approach not only minimizes the impact of business disruption, but also improves organizational adaptability, allowing companies to respond quickly to changes in the work environment.

The further integration of predictive modeling into human resource information systems (HRIS) enables organizations to use these tools as part of a comprehensive decision support system (DSS) that provides HR leaders with actionable insights for proactive people management [4]. Seamless integration with existing HR information systems ensures real-time data flow and updates and enables predictions based on the latest data. For example, if the system detects an increased risk of absenteeism, it can automatically notify managers and suggest preventative actions [5]. In addition, a user-friendly dashboard that visualizes key metrics and predictions can significantly improve the user experience for HR managers by stimulating preventive discussions to resolve potential issues or coordinate appropriate staffing. Such tools support better planning, greater efficiency, and a more agile response to dynamic changes in the workforce, in line with best practices in workforce analytics and workforce planning [4,5].

Motivation. The increasing complexity of HR management has necessitated the development of advanced analytical tools to effectively predict employee absenteeism. Our previous study [6], served as the basis for exploring the demand for automated absence forecasting. This study has shown that organizations have a strong interest in using historical data to accurately predict employee absence, particularly to predict short-term absences such as sick leave and vacation leave up to a week in advance.

The research utilized a survey approach targeting users of the MojeUre system to identify the key features desired in an absence prediction system. The results showed that organizations place a high value on the ability to accurately predict absences within a one-week time frame with minimal errors, ideally within two days. Such features would allow companies to proactively arrange for replacement staff and optimize work allocation to mitigate the impact of unplanned absences.

In addition, the study emphasized the importance of developing a user-friendly analysis tool that can be seamlessly integrated into existing HR systems for easy adoption and practical use in daily business. Feedback from the initial research has guided the design of the study presented in this paper, ensuring that it takes into account the specific needs and preferences of end users.

Problem statement and contributions of this paper. In this paper, we focus on exploring the potential for predicting employee absence (vacation leave and sick leave) from the MojeUre database using machine learning. Furthermore, this paper evaluates the performance of machine learning methods in predicting employee absenteeism in different forecasting scenarios: one week, two weeks, and one month ahead. In this paper, we use symbolic machine learning methods in the form of predictive clustering trees [7] implemented in the CLUS software [8]. The main contributions of this study include:

Practical application to real data—We use real data from the MojeUre system (https://mojeure.si/, accessed on 18 April 2024), which was provided by the Slovenian company 1AInternet, d.o.o. This practical application emphasizes the potential of the models we have created to be integrated into actual business operations and significantly improve HR decision-making processes.
Feature-based data representation—We transform time series data from the MojeUre database into a feature-based format suitable for machine learning applications through careful feature engineering and the integration of expert knowledge.
Comparative method analysis—We perform a comprehensive comparison of different methods based on predictive clustering trees. This includes methods that output single models and ensemble methods that output multiple models to determine their effectiveness at different prediction intervals. This analysis also includes an overview of single and multi-target methods and highlights their respective advantages and limitations in predicting short-term and long-term absenteeism.

Organization. In Section 2 we define the problem of predicting employee absenteeism and the different types of absenteeism and present our methodology to approach this challenge through machine learning. Furthermore, in Section 3, we present the data used in this study, and in Section 4 we describe the experimental design and research questions. In Section 5 we present the results of our experiments in detail, focussing in particular on the prediction of vacation leave and sick leave. In Section 6 we discuss these results and address more general issues and implications of using the models created to support strategic decisions. The paper concludes in Section 7 with a summary of our contributions and possible directions for future research.

2. Background and Related Work

Here we outline how the problem of predicting employee absenteeism can be conceptualized and structured as a machine learning task, including a discussion of the transformation of this problem into different machine learning tasks. Next, we discuss different modelling approaches and introduce the predictive clustering trees as the chosen methodology for building models in this study. Finally, we present an overview of related work.

2.1. The Problem of Predicting Employee Absence

In this paper, we define the prediction of employee absence as the process of predicting when and how often employees might be absent from work in the future. Our main goal in this paper is to predict employee absenteeism based on historical timesheet data. We focus on predicting two types of absences: sick leave and vacation leave absence. For this purpose, we use the historical absence profiles of employees together with demographic data.

In our view, predicting vacation leave and sick leave represents two distinct analytical tasks, each with unique characteristics and implications. Vacation leave is usually planned and predictable, with trends possibly correlated with school holidays, summer months, or the holiday season, making it somewhat easier to predict using historical data. In contrast, sick leave is often unplanned and can be influenced by a variety of unpredictable factors such as personal health problems or epidemics within the community. Therefore, modelling techniques and the type of data used for prediction differ significantly. For vacation leave, the data could include the timing of past holidays, length of service, and company-wide vacation policies. In contrast, predicting sickness absence could focus on local health trends, historical sickness absence data, and individual health predispositions.

The development of separate forecasting models for vacation and sick leave is crucial, as each type of absence is of fundamental importance. Separate models allow for customization of features, such as the inclusion of public health data for sickness absence predictions and school calendar data for vacation predictions, improving the accuracy and relevance of predictions. In addition, the consequences of each type of leave can be different. For example, an unplanned sick leave may require an immediate temporary replacement, while a planned leave may give managers time to prepare. Precise models are therefore required to optimize personnel management and maintain productivity.

2.2. Formulating the Prediction of Employee Absenteeism as a Machine Learning Task

Predicting employee absenteeism can be translated into different machine learning tasks, each of which fulfills specific prediction requirements and provides unique insights. Depending on the desired result and the level of detail, this prediction task can be approached as binary classification, regression [9], multi-target regression, multi-label classification [10] or hierarchical regression [11]. The selected task also influences the data representation used in the learning process, which is described in more detail in Section 3. Here, we present the different problem formulations in more detail.

In the binary classification approach to predicting employee absence, the task is to determine whether an employee will be absent (1) or present (0) on a given day or week. This formulation is simple yet powerful for immediate, short-term planning requirements. This model can use characteristics such as the day of the week, public holidays, individual workload, and historical absence patterns to make predictions.

The (single-target) regression task focuses on predicting the number of days an employee will be absent within a certain time frame, e.g., a week. With this task, we not only predict attendance or absence, but also quantify the extent of absence. It is suitable for creating detailed forecasts that help with resource allocation and strategic planning in companies. The key features that can be used to create a regression model include aggregated absence data over a specific historical period.

Multi-target regression involves predicting the number of absence days for multiple future periods simultaneously, such as several consecutive weeks. Multi-target regression extends predictions to multiple future intervals, providing a comprehensive view that aids in long-term strategic planning. This approach is beneficial for understanding patterns over extended periods, which can be critical for addressing systemic issues. For this approach, it is useful to use lagged features. Lagged features are variables derived from past data points. For example, if you are trying to predict employee absences for the week ahead, lagged features could include data such as the number of absence days in previous weeks. This approach was also used in our work.

In a multi-label classification scenario, we want to predict multiple binary outcomes (present or absent) for each unit of time within a period, such as daily predictions over a week. This approach is suitable for weekly planning where daily absence predictions are needed. This approach requires both day-specific and general predictive features.

In a hierarchical regression task, we build models to predict absences at multiple levels of aggregation, from more general (e.g., quarterly) to more specific (e.g., monthly, weekly). This model helps to understand and plan for patterns that occur on different timescales.

2.3. Modelling Approach Selection

Choosing the right modeling approach for an application scenario can have a significant impact on the effectiveness and reliability of predictions. Building single models or ensemble models are two basic approaches that can be used for each of the tasks described above, with each approach having its strengths and challenges.

Single models are generally easier to generate, interpret and implement. They can usually be trained faster than ensemble models and are therefore suitable for environments where speed is critical. These models require fewer computing resources, which can be a deciding factor when working with limited hardware or in cloud environments where computing time is costly. Single models are often more prone to overfitting, especially if they are not properly regularised or if the model is too complex in relation to the amount of training data.

Ensemble models use multiple base models to achieve better predictive performance than could be achieved by one of the models alone. Examples of ensemble methods that can be used to generate ensemble models are Bagging [12], Random Forests [13], Random Subspaces [14] method and others. By combining multiple models, ensembles often achieve higher accuracy and better generalisation to unseen data compared to single models. They are particularly effective in reducing variance and bias. Ensemble methods are more robust against overfitting, especially in cases where the individual models are simple (e.g., decision trees in a random forest).

In the context of this study, robustness refers to the ability of predictive models to provide consistent and reliable results across diverse and potentially noisy data sets and to maintain their performance in the face of various data irregularities and modeling challenges [15,16].

Ensemble models are generally more complex to generate and tune due to the interactions between the individual models. The methods used to create them require more computing resources, which can increase training times and require more powerful hardware or cloud computing power. These models are usually more difficult to interpret than individual models.

When predicting employee absenteeism, the distinction between single-target and multi-target prediction models can significantly influence the outcome and applicability of the predictions. These two approaches fulfill different prediction requirements and offer different advantages and challenges.

Single target prediction models predict a dependent variable as output for each input instance. In the context of employee absence, this could mean predicting whether an employee will be absent on a given day (binary classification) or predicting the number of days an employee will be absent in a given week (regression). Single-target models can be fine-tuned to predict a particular outcome very well, which can be an advantage when high accuracy is required for that particular prediction. They do not take into account correlations between multiple targets, which can lead to inefficiencies and missed insights where outputs are interdependent. If predictions are needed for multiple targets (e.g., for the days of a week), a separate model is required for each target, which can complicate the use and maintenance of the prediction system.

Multi-target prediction models simultaneously predict multiple dependent variables from the same set of inputs. When predicting employee absenteeism, this could mean predicting the number of days of absence over multiple weeks or predicting daily absence for each day of a single week. These models process multiple targets simultaneously, which can be more efficient in terms of calculation and data utilisation than creating a separate model for each target. They can take advantage of the inherent correlations between multiple targets, potentially improving the accuracy of all targets. By learning multiple targets at once, these models can generalise better, especially when the targets influence each other. Few machine learning algorithms inherently support multi-target regression or classification of multiple targets, which can limit the choice of tools and approaches.

2.4. Predicting Employee Absence with Predictive Clustering Trees

In the methodological framework of our study, which aims to predict employee absenteeism, we used Predictive Clustering Trees (PCTs) [7], which are available in the CLUS Toolbox [8] (https://github.com/ElsevierSoftwareX/SOFTX-D-23-00369, accessed on 10 May 2024). PCTs offer a versatile approach to machine learning that goes beyond traditional decision trees by combining both clustering and predictive capabilities. These models are ideal for scenarios where both clustering of similar instances and prediction of outcomes are required. Essentially, PCTs build a tree where each node serves not only as a decision point for splitting data but also as a cluster that groups similar data points. Using this method, PCTs can process different data sets by recognizing and exploiting the inherent groupings within the data.

In our study, PCTs were used in two different constellations: Single models (denoted as ST) in a regression setting to predict the number of days of absence and in a multi-target regression setting (denoted as MT), where the aim was to predict absences for several future periods simultaneously.

As a baseline for comparison in our study, we have used a PCT of minimal complexity consisting of only one node. This simple model serves as a basic reference point that allows us to evaluate the effectiveness and added value of more complex models in predicting employee absenteeism.

To improve the robustness and accuracy of our predictions, we have extended our use of PCTs to various ensemble methods [17]. Specifically, we used PCTs for Bagging, Random Subspaces (denoted as RSubspaces), Random Forest (denoted as RForest), and a combined method that integrates both bagging and random subspaces (denoted as BagSubspaces).

Bagging or bootstrap aggregating [12], significantly improves the stability and accuracy of machine learning algorithms. In this technique, multiple versions of a model are trained on different subsets of the original dataset, which are selected with replacement.

In the random subspaces method [18], also known as attribute bagging or feature bagging, each model in the ensemble is trained using a random subset of features rather than data points. This strategy reduces the correlation between the models within the ensemble by giving them different perspectives on the data based on different attributes.

Random Forests, as described by Breiman [13], improve the bagging algorithm by adding an additional layer of randomness during tree construction. This method not only creates multiple bootstrap samples from the data but also selects a random subset of features at each decision point when the trees are constructed. Random forests can also evaluate the influence of each attribute on the predictions. In this process, numerous trees are trained on random subsets of data, and the impact of each attribute on the model’s decision process is measured. This ranking identifies the most influential attributes in the dataset, providing valuable insights for attribute selection and model refinement.

The method combines bagging and random subspaces, ref. [19] both the randomness of the data and the randomness of the features. In this innovative approach, multiple bootstrap samples are created from the original dataset, and each model is trained on a specific subset of features for each sample. By combining these methods, we aim to significantly increase the diversity of models in the ensemble, which can lead to better performance in terms of accuracy and robustness.

2.5. Related Work

In the growing field of human resources analytics, the use of machine learning to predict absenteeism of employees is becoming increasingly important for effective workforce management. This section looks at a number of studies that enhance our understanding of how predictive analytics can be used in HR. We explore a wide range of topics, from the use of neural networks to the impact of demographic factors and the identification of risk factors for absenteeism. The literature is divided into topics such as the use of advanced computational models, analysis of long-term data, demographic influences, risk factors, ethical considerations, impact on employee engagement and practical applications of HR analytics. Together, these topics deepen our understanding of how historical data can be used to predict employee absenteeism. This closely aligns with the focus of our work on predicting future absenteeism using machine-learning techniques based on historical patterns.

In exploring advanced modeling techniques within HR, Dogruyol et al. [20] conducted a comprehensive evaluation of three neural network models—Backpropagation, Radial Basis Function, and Long Short-Term Memory (LSTM)—to predict absenteeism of employees. Their study highlights the superior performance of the LSTM model, which achieves near-perfect prediction accuracy in complex data scenarios. This emphasises the potential of advanced neural network architectures in processing detailed absence patterns and demonstrates their adaptability to complicated HR data. Similarly, Jayme et al. [21] address the challenge of predicting employee absence risk using various machine learning methods (SVM), including neural networks, random forest, and support vector machines. They investigated the impact of training data size and different feature sets on model performance and introduced a method to rank the sensitivity of neural networks to individual features, providing valuable insight to improve predictive models. In general, the integration of deep learning methods, as proposed in recent [22,23] could improve prediction performance, especially when capturing complex patterns and temporal dependencies within the data.

Ali Shah et al. [24] discuss a novel deep neural network (DNN) model to predict employee absenteeism using data from a Brazilian courier company. The DNN achieved a 97.5% accuracy, outperforming traditional models such as Decision Trees and SVM, highlighting its potential to help organizations manage absenteeism effectively. Lima et al. [25] investigate deep learning models to predict absenteeism among public security agents, finding that Multilayer Perceptrons (MLP) achieved the highest accuracy at 78%, demonstrating the applicability of these models in public security contexts.

Thomson [26] uses multiple regression to examine demographic influences on employee absenteeism. More specifically, the author examines the effects of age and seniority on employee absenteeism in different work groups within the UK local government. The study identifies linear and curvilinear relationships, with seniority moderating the impact of age on absence, contributing to a more nuanced understanding of the impact of demographic factors on workplace attendance.

Boot et al. [27] focus on identifying the most important predictors of long-term and frequent sick leave absence among airline employees. Their analysis, using logistic regression, identifies factors such as age, pregnancy, working conditions, and previous absenteeism as significant predictors and provides practical insights into the risk factors that can predict sickness absence with considerable accuracy. In addition, Montano [28] presents a systematic analysis of machine learning models used to predict absenteeism and temporary disability. Reviewing research from 2010 onwards, the study highlights artificial neural networks (ANNs) as the most effective and widely used models, particularly in Brazil and India, but also in countries such as Saudi Arabia and Australia.

Notenbomer et al. [29] aim to develop prediction models for long-term sick absence (SA) among employees with frequent SA. Their models included factors like job demands, job resources, burnout, and work engagement, showing significant but limited predictive ability, indicating the need for further development.

Edwards et al. [30] address broader challenges and ethical considerations in the field of HR analytics. Their discussion in a special issue on HR analytics highlights the complexity of conducting empirical HR research and argues for greater recognition and systematic exploration of analytical techniques in human resource management.

Al Zeer et al. [31] investigate the effects of employee engagement and empowerment on performance in higher education. Their results, derived from structural equation modeling, emphasize the significant impact of engagement and empowerment on performance and highlight the importance of these factors in improving organizational outcomes.

Salazar et al. [32] deal with the persistent problem of absenteeism, especially in Brazil. They investigated different machine learning models to predict whether patients will attend their scheduled appointments using an end-to-end machine learning process. The study highlights the decision tree algorithm as an effective model for predicting patient attendance and shows the practical applications of machine learning in different contexts of absence prediction.

Lawrance et al. [33] describe a decision support system designed for a Belgian HR and Well-Being service provider to predict employee absenteeism using machine learning models and real HR and payroll data, focusing on cost-sensitive learning to address unequal misclassification costs.

Our own preliminary studies, described in detail in [34,35], focus on the practical application of predictive modelling of employee absenteeism using historical timesheet data. These papers discuss the influence of historical data window size on predictive accuracy and address the challenges that arise in effectively representing temporal data, providing empirical insights that help refine predictive modeling in HR.

In reviewing the literature, we have identified several problems and gaps that could be addressed. First, if we focus on the intelligent analysis approaches used in the state of the art, we can see that the authors mostly use basic statistics and classical machine learning approaches for classification and regression, such as decision trees or neural networks. The authors have not yet addressed more complex tasks dealing with complex data, such as the tasks of multi-target prediction and multi-label classification, which play an important role in the field of machine learning.

Furthermore, the explanatory power of automatically generated models for HRM experts has not been a focus of research to date. Moreover, the literature mainly focuses on predictive analyses and there are not so many descriptive analyses based on historical data. Finally, if we focus on currently available time-tracking software, only some systems have developed predictive analytics services, but these do not really provide good decision support for management structures. For example, they have automatic scheduling based on labour laws, overtime, and coordination with other employees, but they do not really predict how some of the employees will work in the next week/month/half year based on historical data. In addition, no similar prediction scenarios were found for predicting employee absenteeism.

3. Employee Absenteeism Data

3.1. Data Sources

The MojeUre system (https://mojeure.si/, accessed on 20 June 2024) was developed by 1A Internet d.o.o., a Slovenian company, to facilitate employee scheduling, working time recording and absence management. This comprehensive system enables the uncomplicated recording of working hours and offers functions such as holiday management, sick leave recording, travel orders and much more.

Working hours can be recorded on the Internet or through a mobile app. In addition, companies can purchase a time recording machine that allows employees to clock in or out using personalized cards. This system supports different types of breaks, such as lunch breaks and private breaks. Employees can also record their hours at NFC (Near Field Communication) or BLE (Bluetooth Low Energy) enabled clocking points or by scanning a QR code with their mobile phone camera. Manual entry of working hours is also possible, allowing you to record different types of work hours within a single day with just a few clicks.

The data analyzed in this study comes from the MojeUre electronic system, which is used by more than 200 different companies in Slovenia. This system primarily records employee check-ins and check-outs and categorizes different types of absences, including sick, vacation, paternity, maternity, part-time, study, and student leave.

For this paper, we analysed the MojeUre data from 2017 to the end of 2020. This period was chosen because of the availability of data in the system’s database and was deemed sufficient for our initial proof-of-concept analysis. Future research will focus on more recent data, especially after 2020, to assess the impact of the global COVID-19 pandemic.

In compliance with the General Data Protection Regulation (GDPR—https://eur-lex.europa.eu/eli/reg/2016/679/oj, accessed on 12 April 2024), the data provider signed the necessary agreements with all participating companies. This ensures that individual data is not only accessible for review by the companies themselves but is also legally compliant for use in analytical purposes.

In Figure 1 we show the distribution of the different company types from our dataset. As we can see, most employees are employed in the company type “Education, translation, culture and sport”, the second most common company type is “Agriculture, fishing and forestry” and the third most common company type is “Production”. The fewest employees are employed in the “Insurance” company type.

In Figure 2a we show the distribution of employee data between different statistical regions of Slovenia as recorded in our database. The “Central” region, in which the Slovenian capital Ljubljana is located, has the most employees. This is followed by the “South-East Slovenia” region. The “Littoral-Inner Carniola” region has the fewest employees, which indicates that fewer companies are represented in the database. This graph illustrates the regional differences in employment within the data.

In Figure 2b we visualise the statistical regions of Slovenia. The country is divided into 12 different regions, ranging from the smallest, R12—Central Sava, to the largest, R7—Central Slovenia. This map provides a visual representation of the geographical distribution and relative size of the regions.

3.2. Data Attributes and the Structure of the Data Instances

In Table 1, we are presenting general structure of data instances used in the learning tasks. The datasets to predict employee absences are organized into descriptive attributes and target attributes.

Descriptive attributes include all variables that provide nominal information about individual employees. These are categorised into two main types: Demographic profile and Absence profile.

The demographic profile, detailed in Table 2, contains several attributes that describe the demographic characteristics of the employees. These include the company type, with 33 predefined categories that specify the industry and company structure; the job type, which categorizes an employee’s role into one of six different categories based on their responsibilities and department; the statistical region in Slovenia where the company is registered, which is selected from 12 different regions and provides a geographical context; the working hours specified in the employee’s employment contract, which helps differentiate between part-time, full-time and other working hours; the type of employment, which indicates whether it is part-time, full-time, or student employment, all of which can have a potential impact on absenteeism behavior; and the number of years the employee has been with the company, which can affect their stability and likelihood of absenteeism. These attributes are important for analysing factors that may influence employee attendance and are crucial for any predictive modelling related to workplace absenteeism.

The abscence profile is composed of lagged attributes. Lagged attributes in predictive modelling are historical data points that are used as input for the prediction of future events. In the context of predicting employee absences, lagged attributes are particularly useful for capturing patterns and trends over time that indicate an employee’s future absence behaviour. Our dataset contains two types of lagged attributes: statistical and seasonal.

Statistical lagged attributes are derived by aggregating past data over specific time periods, labelled XY in Table 3 (from 1 week to 12 months), to capture both short-term and long-term trends that help predict future outcomes. These attributes, detailed in Table 3, provide a comprehensive overview of an employee’s absence history in three different forms.

Firstly, we take into account the total number of days an employee was on vacation or sick leave during a given period to obtain an aggregate count that shows the general patterns of absence. In addition, we separate these counts into short-term (three days or less) and long-term (more than three days) categories for both vacation and sick leaves. This distinction not only provides insight into the duration of past absences, but also helps to predict similar trends in the future.

Finally, the dataset also contains daily absence figures, labelled Z, which indicate the exact days of the week on which an absence occurred, e.g., Mondays or Fridays. This level of detail helps to identify weekly patterns that are crucial to understanding regular absences, such as the tendency to take long weekends.

The seasonal lagged attributes shown in Table 4 provide aggregates based on the occurrence of public holidays and the effects of seasonal variations. These attributes are calculated for future data projections and are important to improve the predictive accuracy of the models during periods when employee absence patterns are affected by seasonal factors. These attributes help the models account for the increased likelihood of absences during public holidays and more specific holiday periods such as winter or spring holidays, which can vary greatly by region and culture. The attributes also capture absences during different seasons and reflect the seasonal impact on employee health and behavior. For example, absenteeism could increase in winter due to the flu season.

On the other hand, the target attributes are used as prediction outputs. In the single-target prediction (see Table 1a), only one attribute is used as an output variable (e.g., the number of days of absence in the next week). With multi-target prediction (see Table 1b), two or more attributes are predicted simultaneously (e.g., the number of days of absence for the next four weeks). In Table 5 you will find the attributes that are used as targets in the case of vacation leave prediction and sick leave prediction.

3.3. Methodology for Dataset Construction

To develop robust predictive models for vacation and sick leave, we created two different dataset groups with a consistent methodology, differing only in their target attributes. Each dataset is aimed to be used for predicting the number of days of leave within a given week (in the case of single-target prediction) or several weeks (in the case of multi-target prediction). In the first case, the targets represent vacation days, in the other case sick days. Both dataset groups contain comprehensive descriptive attributes that include both the demographic profiles of employees and their absence profiles, as described in Section 3.2. In total, we used 145 descriptive attributes in the datasets.

The datasets were created using the sliding-window technique, which is ideal for capturing temporal patterns and trends in time series data. For each employee, we started with the earliest available data and created an observation window that captures all relevant data up to a specific week. The number of days of leave in the current week represents the target attribute for this observation. All lagged attributes that provide historical context are derived from the previous weeks’ data. In the case of a multi-week forecast, the delayed attributes are calculated for the weeks preceding the first week in the target attributes. In Figure 3, we present the sliding window technique for the case of a one-week forecast (Figure 3a), a two-week forecast (Figure 3b) and a four-week forecast (Figure 3c).

To systematically generate training examples, the window is moved forward by one week, and the process is repeated until the end of the available data for each employee. This method ensures that each training instance is a snapshot of the employee’s record, containing all relevant information up to the week immediately preceding the prediction week. This sliding window approach is applied uniformly to all companies and employees to ensure that the datasets are comprehensive and consistently structured.

4. Experimental Design

4.1. Research Objectives and Questions

This study uses demographic profiles and historical attendance profiles to predict future absenteeism using various predictive modelling techniques, including single-target versus multi-target and single model versus ensemble of models. This section outlines the research questions that guide our investigation of the effectiveness of different predictive models over different time periods.

The research questions formulated for this study (see Table 6) aim to explore the capabilities of predictive models in accurately predicting employee absences over short, medium, and long terms. These questions are designed to test the relative performance of different modeling approaches and their practical implications for HR analytics. By addressing these questions, we aim to uncover the most efficient and accurate predictive strategies, thus assisting HR professionals in strategic decision-making processes.

4.2. Datasets

We created two comprehensive datasets from employee data that span from May 2017 to December 2021, covering a total of 3061 employees. Each dataset is tailored to a specific predictive task: one focuses on forecasting vacation leave, while the other targets predicting sick leave. The datasets were constructed using the sliding window methodology, which is detailed in Section 3.3. The attributes included in the data sets, as described in Section 3.2, were carefully engineered to improve the predictive accuracy of our models. By applying the sliding window technique over the entire period, we accumulated 312,453 data examples in each dataset, with each employee being represented multiple times. This robust approach captures temporal patterns and nuances in employee absences.

4.3. Evaluation Metrics

To extensively evaluate the performance of our predictive models, we employed four key metrics for regression tasks. These include mean absolute error, mean squared error, root mean squared error, and Pearson correlation coefficient [9]. In Table 7 we present all the metrics and the interpretation of what the metrics show us in our context.

4.4. Methods and Experimental Setup

In this study, we used predictive clustering trees (described in Section 2.4) implemented in the CLUS software toolbox. We used the CLUS software for modelling, followed by Python scripts to aggregate the results of the experiments. We compare the performance of 6 methods: Baseline (with PCT from only one node), regression tree, bagging of PCTs, random forest of PCTs, random subspaces of PCTs (labeled Rsubspaces), and combined method integrating bagging and random subspaces (labelled BagSubspace). For the ensemble methods, we set the number of iterations to 100, as suggested in the literature [13]. The maximum number of features that should be considered when selecting the best split in the random forest algorithm was set to SQRT of the number of features, as recommended by Breiman [13]. All other parameters were left at their default settings to ensure consistency and comparability between the different modeling techniques.

In our study, we utilized 10-fold cross-validation as a validation technique. The reported results are the average values of the evaluation measures on 10 folds.

The machine on which the experiments were performed is powered by an Intel(R) Core(TM) i7-7700 CPU running at 3.60 GHz, featuring 4 physical cores and 8 threads. It is equipped with 32GB of RAM. It operates on Linux with kernel version 5.14.0.

5. Results

In this section, we present and analyse the results of our computational experiments, which are structured to systematically address each of our research questions as previously described. The results are divided into two main parts to reflect the different nature of the predictions. The first part focuses on predicting vacation leave, and the second on sick leave. For each category, we evaluate the performance of our forecasting models over different time periods—from one week to one month—and examine the effectiveness of different modelling techniques, including single-target, multi-target and various ensemble methods.

5.1. Vacation Leave Prediction

5.1.1. Short-Term Predictive Performance (Single-Target Prediction)

To assess the effectiveness of predicting vacation leave one week in advance based on our data representation, we compared the performance of the baseline method and regression trees with ensemble methods across four evaluation metrics. For this purpose, we used single-target prediction methods. The results are presented in Figure 4 (detailed results are available in Table S1 in the Supplementary Materials). Based on these metrics, the Bagging method delivers the best results for predicting one-week-ahead vacation leave, outperforming the other models. In contrast, the baseline method produced the poorest results in our evaluation, demonstrating the importance of more advanced techniques for accurate forecasting.

5.1.2. Medium-Term Predictive Performance (Multi-Target Prediction)

To assess the predictive accuracy of vacation leave absences over a two-week period, we employed multi-target versions of various tree-based algorithms, building a single model that predicts absences for both weeks simultaneously. The results of these predictive models are summarized in Figure 5 (detailed results are available in Table S2 in the Supplementary Materials).

The results reveal that Bagging consistently emerges as the most effective method across all evaluation metrics. In comparison, the baseline model shows the poorest performance, producing the highest errors in MAE, MSE, and RMSE, and even showing a negative correlation coefficient, indicating a lack of predictive power.

In Figure 5, it is also evident that the models generally achieve a better accuracy in predicting the first week, as indicated by the lower error rates compared to the predictions for the second week. This suggests that forecasting accuracy diminishes as the time horizon extends, emphasizing the challenge of predicting absences further into the future.

Overall, ensemble methods such as Bagging and Random Forests demonstrate superior performance compared to single-model approaches, underscoring the value of combining predictions from multiple models to enhance predictive accuracy for this task.

5.1.3. Long-Term Predictive Performance (Multi-Target Prediction)

To evaluate the effectiveness of predicting vacation leave one month ahead using our data representation, we compared the performance of various predictive models. The results, presented in Figure 6 (detailed results are available in Table S3 in the Supplementary Materials), reveal significant differences in predictive performance.

Among predictive models, bagging consistently delivers the best results for forecasting vacation leave one month ahead, achieving the lowest error rates across all target weeks and the highest correlation coefficient. In contrast, the baseline model exhibits the poorest performance, as indicated by its highest MAE, MSE, and RMSE values, alongside a negative CC, which underscores its lack of predictive power.

The analysis also demonstrates that bagging generally outperforms all other methods in each of the four target weeks. Additionally, the data in the table reveal that predictions for the first week are generally more accurate than predictions for the subsequent weeks, as indicated by the lowest RMSE error for the first week. Predictive errors increase and the CC value decreases as the forecast period extends further into the future, highlighting the increasing challenge of making accurate predictions as the prediction window lengthens.

5.1.4. Comparison of Prediction Approaches

To forecast vacation leave one month ahead, we have the option of using single-target models, which predict each week separately, or multi-target models, which predict all weeks together. To determine the better strategy, we compared the performance of both approaches.

Figure 7 presents the results of the RMSE measure for all targets using both single-target and multi-target models, while in Figure 8 we present the results of the CC measure (detailed results are available in Tables S4 and S5 in the Supplementary Materials). The results clearly show that multi-target models produce smaller errors across all targets. Among the methods evaluated, bagging stands out as the overall best prediction method, achieving the lowest RMSE values. However, the baseline model consistently produced the highest errors, reinforcing its role as a control.

Across the board, multi-target prediction outperformed single-target prediction for all methods, with the most significant gains observed in ensemble methods like Bagging and Random Forest. This finding underscores the advantage of leveraging correlations between targets to improve accuracy. By capturing the inherent relationships between different time frames, multi-target models provide a more nuanced and accurate forecast for vacation leave, particularly when applied through robust ensemble techniques.

5.1.5. Single-Model vs. Ensemble Vacation Leave Prediction

The results consistently indicate (see Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8) that all the ensemble methods outperform the single model (a regression tree in this case) in all three forecasting periods: one-week ahead, two-weeks ahead, and one-month ahead. This demonstrates that ensemble techniques offer superior predictive performance compared to a single-model approach, providing more accurate and reliable forecasts across various time frames.

5.1.6. Attribute Importance with Random Forests for Vacation Leave Prediction

In Table 8, we outline the most important attributes for predicting vacation leave using Random Forests on three different prediction horizons: one-week, two-weeks, and one month. The attributes are ordered according to their importance based on impurity reduction. The results suggest a clear pattern: For shorter prediction horizons (one week), longer-term historical data (up to 12 months) are more significant, reflecting recurring patterns in individual absence behaviors. However, as the prediction horizon extends to two weeks and beyond, more recent data and seasonal influences play a more prominent role. This indicates that, while longer-term historical data remains relevant, the relative importance of seasonal and short-term features increases with the prediction period.

5.2. Sick Leave Prediction

5.2.1. Short-Term Predictive Performance (Single-Target Prediction)

To assess the predictive accuracy of sick leave one week in advance based on our data representation, we compared the performance of the baseline model with regression trees and various ensemble methods using four evaluation metrics. The results, presented in Figure 9, show a clear distinction in performance between models (detailed results are available in Table S6 of the Supplementary Materials).

The baseline model, as expected, produced the poorest results across all metrics, emphasizing the need for more advanced predictive approaches. The regression tree method shows a significant improvement over the baseline.

Among the ensemble methods, Random Forests achieved the best results, showing the lowest RMSE and highest correlation coefficient. Bagging also performed well, with slightly lower RMSE. BagSubspaces and Random Subspaces demonstrated moderate improvements, performing better than the baseline but not outperforming the other ensemble methods.

5.2.2. Medium-Term Predictive Performance (Multi-Target Prediction)

To evaluate the accuracy of predicting sick leave absences over a two-week period, we employed multi-target versions of various tree-based algorithms to build models capable of predicting absences for both weeks. The results, presented in Figure 10, reveal notable variations in predictive performance across methods (detailed results are available in Table S7 in the Supplementary Materials).

Overall, Bagging and Random Forests stand out as the best-performing methods, each achieving strong results with minimal differences between them. Bagging achieved a slightly lower MAE, while Random Forests had a marginally higher correlation coefficient and a lower RMSE.

In particular, the figure also shows that the predictive accuracy is better for the first week than for the second, as evidenced by the lower MAE, MSE, and RMSE for W1 compared to W2. This drop in accuracy for the second week illustrates the challenges associated with forecasting over longer horizons, emphasizing the importance of using advanced models like Bagging and Random Forests to capture the complexities of sick leave patterns.

5.2.3. Long-Term Predictive Performance (Multi-Target Prediction)

To evaluate how well we can predict sick leave one month in advance based on our data representation, we compared the performance of different models, including the baseline, regression trees, and various ensemble methods, using four evaluation metrics. The results, presented in Figure 11, provide insight into the effectiveness of these approaches (detailed results are available in Table S8 of the Supplementary Materials).

Among the methods evaluated, Bagging stands out as the best performing method, achieving the lowest average MAE, MSE, and RMSE, while also having a high correlation coefficient (CC). Random Forests, while slightly behind Bagging, also demonstrate comparable performance.

Additionally, it is evident from the figure that predictions for the first week are more accurate than for subsequent weeks, as reflected in the lower RMSE for W1 compared to later weeks. This decline in predictive accuracy as the forecast period extends illustrates the challenges of long-term forecasting.

5.2.4. Comparison of Prediction Approaches

To predict sick leave one month ahead, we can either use single-target models to predict sick leave for each week separately or build a multi-target model to predict all weeks together. To determine the better strategy, we compared the performance of both approaches.

Figure 12 presents the RMSE measure results for all targets, comparing single-target and multi-target models (detailed results are available in Table S9 in the Supplementary Materials). Across the board, Bagging emerges as the best-performing method with the lowest errors. The results also demonstrate that multi-target models generally provide slightly better performance across methods, reflecting their ability to capture correlations between targets.

The results also reveal a pattern where predictions for week W1 have the least errors, while errors progressively increase in subsequent weeks. This finding suggests that the models are best suited for short-term forecasting.

Figure 13 presents the results for the CC measure. Again, Bagging demonstrates the highest performance overall, with the highest correlation coefficients (detailed results are available in Table S10 in the Supplementary Materials). The figure also indicates that multi-target models tend to provide slightly better correlation coefficients than single-target models.

Similarly to the RMSE results, the CC values are highest for W1 and then progressively decrease with each subsequent week, underscoring the difficulty of accurately forecasting over longer time frames. These results highlight that while Bagging remains the best prediction method for this task, all models perform better for short-term predictions compared to longer horizons.

5.2.5. Single-Model vs. Ensemble Sick Leave Prediction

The results (see Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13) across all forecasting periods—one-week ahead, two-weeks ahead, and one-month ahead—clearly show that ensemble methods outperform the single model (regression tree). This trend is consistent for both sick leave and vacation leave predictions. In the one-week ahead task, Bagging and Random Forest (RForest) yield the best results for both sick and vacation leave, with Bagging generally achieving the lowest errors. For the two-week ahead task, RForest performs best for sick leave, while Bagging leads in predicting vacation leave. In the one-month ahead predictions, Bagging stands out as the best method for both sick leave and vacation leave prediction, consistently achieving the lowest errors and highest correlation coefficients. This underscores the superior predictive performance of ensemble methods across different time frames.

5.2.6. Feature Ranking with Random Forests

In Table 9, we outline the most important attributes for sick leave using Random Forests across three different prediction horizons: one-week, two-weeks, and one month. The attributes are ordered by their importance based on impurity reduction.

Across all prediction horizons, past sick and vacation leave data remain the most important features to forecast future sick leave. The importance of 12-month historical data is evident, indicating that patterns and trends spanning over a year offer valuable insights into predicting future absence. As the prediction horizon extends from one week to one month, there is a slight shift towards considering more recent historical data (six months) alongside longer-term data (12 months). This suggests that both recent and long-term trends are essential for more accurate predictions over extended periods.

6. Discussion

6.1. Summary of Research Question Findings

In Table 10, we present a summary of the findings for each research question. Our analysis revealed key insights into the performance of various predictive models across different time horizons for employee absenteeism (Q1–Q3). The ensemble methods, particularly Bagging and Random Forests, consistently outperformed single models like regression trees. This superiority was evident across one-week, two-week, and one-month prediction intervals, underscoring the robustness of ensemble approaches to capture complex patterns and variances in the data. Bagging, in particular, showed the most consistent performance, offering the lowest error rates and the highest correlation coefficients, thus providing a reliable model for forecasting employee absences.

The comparison between single-target and multi-target approaches (Q4–Q5) further highlighted that while single-target models are straightforward and easier to implement, multi-target models often yield slightly better performance, particularly in longer prediction windows. This suggests that capturing interdependencies between consecutive weeks can enhance predictive accuracy, a crucial insight for HR departments planning over longer periods.

Finally, the feature importance analysis (Q6) highlighted that time-related attributes and specific absence history metrics, such as past sick leave and vacation patterns, were crucial in improving the predictive performance of the models.

The practical implications of these findings are significant. The demonstrated efficacy of ensemble methods and multi-target models can guide HR analytics strategies, encouraging the adoption of more complex, but more accurate, predictive tools. These tools can be instrumental in helping HR professionals anticipate and plan for potential staffing gaps, thereby ensuring that operational efficiency is maintained.

Furthermore, the varying performance across different time horizons raises important questions about the optimal frequency of model updates and the potential need for different modelling approaches depending on the specific forecasting needs of the organization. As we move towards integrating these models into HR decision-making processes, understanding their performance nuances becomes critical in tailoring interventions that are both proactive and contextually appropriate.

6.2. Model Scope and Specificity

A critical decision in predictive modeling for employee absenteeism is choosing between a global model (such as the models described in this paper), which integrates data across various companies, regions, and industries, opposite to more localized models tailored to specific domains or regions. Global models benefit from larger datasets, potentially improving the model’s generalizability and robustness. However, they may overlook nuanced patterns specific to certain industries or regions, such as cultural differences in work habits or regional legal differences that affect absenteeism.

Conversely, localized models can capture these nuances by focusing on data from specific sectors or regions. They might reveal unique predictors of absence that are not evident in a broader dataset. For example, factors that influence absenteeism in a tech company can differ significantly from those of a manufacturing plant due to varying work environments and employee demographics. Similarly, regional models might adapt better to local economic conditions, public holidays, or weather patterns that influence absence rates.

Exploring the transferability of models between sectors or regions is another potential area of research. By testing how models trained in one sector perform when applied to another, or how models developed using data from one region predict absences in a different region, researchers can assess the adaptability and flexibility of their predictive approaches. Such experiments can help identify core, universal predictors of absenteeism across domains and highlight specific factors that are regionally or sector-specific.

Finally, building models on data from a single company brings some challenges. There are risks of overfitting the particularities of the company’s data and the possibility to generalize to new employees or underrepresented groups within the company.

6.3. Ethical Issues

When utilizing machine learning models to predict employee absenteeism, it is crucial to consider the ethical implications, particularly the risk of introducing bias into the learning process. Ensuring fairness and avoiding discrimination is paramount, as biased models could disproportionately affect certain groups based on gender, age, race, or other characteristics, leading to unfair workplace practices. To mitigate these risks, data used for training models must be carefully curated to accurately represent diverse employee populations. Additionally, transparency in how models are constructed and employed is necessary to maintain trust and accountability. Regular audits and updates of these models should be conducted to assess and rectify any emergent biases or inaccuracies. By prioritizing ethical considerations in the development and deployment of predictive models, organizations not only comply with legal standards but also uphold a commitment to fairness and equity in their HR practices.

6.4. Model Updates and Maintenance

As with any predictive model, ongoing maintenance and improvement are key to maintaining performance over time. It is essential to engage in discussions about the frequency of retraining or updating the models as new data become available. Determining the optimal retraining frequency involves balancing between responsiveness to new data and computational efficiency. For some sectors, quarterly updates might suffice, while in fast-changing sectors, monthly or even weekly updates may be necessary. This schedule should be tailored based on the rate of change in the key variables that influence employee absence rates.

For environments with continuously streaming data, incremental learning models are ideal. These models update continuously as new data arrive, rather than requiring batch processing of large datasets. This approach not only saves on computational resources but also keeps the model up-to-date in real time, enhancing its predictive accuracy.

Predictive models should also be reviewed in light of changes in company policies, labour laws, and economic conditions. For instance, changes in work-from-home policies or amendments in employment laws should trigger a reassessment of the model to ensure its continued applicability.

Finally, developing a model evolution strategy is advisable, where a pipeline of model development, from initial training through various iterations of updates and improvements, is clearly defined. This strategic approach ensures a structured evolution of the model, maintaining its alignment with organizational goals and external conditions.

7. Conclusions and Future Work

Our study emphasises the effectiveness of ensemble methods, especially the bagging method, in predicting employee absenteeism across different prediction intervals. For one-week-ahead predictions, the bagging method proves to be the most accurate for vacation leave, while Random Forest performs best to predict sick leave. For predictions two weeks in advance, optimal results are achieved when the predictions are made one week in advance using these specific methods. Even for one-month-ahead predictions, bagging continues to show superior performance for vacation leave, and Random Forest remains effective for sick leave. The comparative analysis of single-target and multi-target predictions emphasizes the robustness of bagging for different prediction tasks and time periods.

Vacation leave predictions tend to be more reliable and straightforward, attributed to the regularity and predictability of the historical data, which often includes planned absences. The clarity of these patterns enhances the performance of methods like Bagging, making it particularly suited for modeling vacation leave.

As part of our future research, we will investigate different scenarios for predictive modeling with the prepared datasets. We plan to investigate the selective predictive analysis of absenteeism by identifying critical weeks influenced by factors such as seasonality, proximity to public holidays, and job-specific requirements. In addition, regional and domain-specific analyses will examine how absence predictions differ by location and type of organization. More detailed time-based comparisons—from daily to monthly predictions—will also be analysed using hierarchical regression and multi-label classification techniques to predict detailed weekly employee profiles.

To adapt to the changes in working patterns after the COVID-19 pandemic, including the increase in remote working, we want to integrate post-pandemic data into our models. In addition, a real-time analytics tool will be developed that can be integrated into HR systems to enable proactive identification and management of irregular absence patterns. Improving the interpretability of the model and creating user-friendly dashboards are also crucial to promote trust and improve the practical use of such a tool in the organisational environment to support decision-making.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app14167037/s1. All detailed experimental results (presented in table format) that are referenced in the text of the manuscript are available in the Supplementary Material and it contains the following tables: Table S1: Results for one-week-ahead vacation leave prediction; Table S2: Results for two-week-ahead vacation leave prediction; Table S3: Results for one month ahead vacation leave prediction; Table S4: RMSE results comparing single-target and multi-target models for vacation leave prediction; Table S5: Result CC measure for single-target vs. multi-target vacation leave prediction; Table S6: Results for one-week ahead sick leave prediction; Table S7: Results for two-week ahead sick leave prediction; Table S8: Results for one-month ahead sick leave prediction; Table S9: Result RMSE measure for single-target vs. multi-target sick leave prediction; Table S10: Result CC measure for single-target vs. multi-target sick leave prediction.

Author Contributions

Conceptualization, P.Z. and P.P.; methodology, P.P.; software, P.Z.; validation, P.Z. and P.P.; investigation, P.P.; resources, P.Z.; data curation, P.Z.; writing—original draft preparation, P.Z.; writing—review and editing, P.P.; visualization, P.Z. and P.P.; supervision, P.P. All authors have read and agreed to the published version of the manuscript.

Funding

Peter Zupančič is supported by the company 1A Internet, d.o.o. and Panče Panov is supported by the Slovenian Research and Innovation Agency through program group P2-0103.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are not publicly available due to privacy concerns and confidentiality agreements. These restrictions are in place to comply with legal obligations, and safeguard proprietary interests. Details about the data and the conditions for access can be obtained from the corresponding author upon reasonable request.

Acknowledgments

We want to thank Dragi Kocev for the discussion and the advice regarding data representation. We express our gratitude to 1A Internet d.o.o. for providing access to the data utilized in our research.

Conflicts of Interest

Author Peter Zupančič was employed by the company 1A Internet, d.o.o. The authors declare no conflict of interest.

References

Malisetty, S.; Archana, R.; Kumari, K.V. Predictive Analytics in HR Management. Indian J. Public Health Res. Dev. 2017, 8, 115. [Google Scholar] [CrossRef]
Mishra, S.N.; Lama, D.R.; Pal, Y. Human Resource Predictive Analytics (HRPA) for HR management in organizations. Int. J. Sci. Technol. Res. 2016, 5, 33–35. [Google Scholar]
Sutanto, E.M. Forecasting: The key to successful human resource management. J. Manaj. Dan Kewirausahaan 2000, 2, 1–8. [Google Scholar]
Mohammed, D.A.Q. HR analytics: A modern tool in HR for predictive decision making. J. Manag. 2019, 6, 51–63. [Google Scholar] [CrossRef]
Bocewicz, G.; Golińska-Dawson, P.; Szwarc, E.; Banaszak, Z. Preventive maintenance scheduling of a multi-skilled human resource-constrained project’s portfolio. Eng. Appl. Artif. Intell. 2023, 119, 105725. [Google Scholar] [CrossRef]
Zupančič, P.; Klisara, J.; Panov, P. Razvoj orodja za napovedovanje odsotnosti zaposlenih: Analiza potreb uporabnikov. J. Univers. Excell. (JUE)/Revija za Univerzalno Odličnost (RUO) 2023, 12, 3. [Google Scholar]
Blockeel, H.; Raedt, L.D.; Ramon, J. Top-Down Induction of Clustering Trees. In Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, CA, USA, 24–27 July 1998; pp. 55–63. [Google Scholar]
Petković, M.; Levatić, J.; Kocev, D.; Breskvar, M.; Džeroski, S. CLUSplus: A decision tree-based framework for predicting structured outputs. SoftwareX 2023, 24, 101526. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; Volume 2. [Google Scholar]
Tsoumakas, G.; Spyromitros-Xioufis, E.; Vrekou, A.; Vlahavas, I. Multi-target regression via random linear target combinations. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, 15–19 September 2014; Springer: Berlin/Heidelberg, Germany, 2014. Proceedings, Part III 14. pp. 225–240. [Google Scholar]
Giunchiglia, E.; Lukasiewicz, T. Coherent hierarchical multi-label classification networks. Adv. Neural Inf. Process. Syst. 2020, 33, 9662–9673. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Skurichina, M.; Duin, R.P. Bagging, boosting and the random subspace method for linear classifiers. Pattern Anal. Appl. 2002, 5, 121–135. [Google Scholar] [CrossRef]
Freiesleben, T.; Grote, T. Beyond generalization: A theory of robustness in machine learning. Synthese 2023, 202, 109. [Google Scholar] [CrossRef]
Cooper, A.F.; Moss, E.; Laufer, B.; Nissenbaum, H. Accountability in an algorithmic society: Relationality, responsibility, and robustness in machine learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 864–876. [Google Scholar]
Kocev, D.; Vens, C.; Struyf, J.; Džeroski, S. Tree ensembles for predicting structured outputs. Pattern Recognit. 2013, 46, 817–833. [Google Scholar] [CrossRef]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
Panov, P.; Džeroski, S. Combining bagging and random subspaces to create better ensembles. In Proceedings of the International Symposium on Intelligent Data Analysis, Ljubljana, Slovenia, 6–8 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 118–129. [Google Scholar]
Dogruyol, K.; Sekeroglu, B. Absenteeism Prediction: A Comparative Study Using Machine Learning Models. In Proceedings of the 10th International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions— ICSCCW-2019, Prague, Czech Republic, 27–28 August 2019; Aliev, R.A., Kacprzyk, J., Pedrycz, W., Jamshidi, M., Babanli, M.B., Sadikoglu, F.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 728–734. [Google Scholar]
Jayme, A.; Lösel, P.D.; Fischer, J.; Heuveline, V. Comparison of Machine Learning Methods for Predicting Employee Absences. Prepr. Ser. Eng. Math. Comput. Lab 2021, 2. [Google Scholar] [CrossRef]
Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef] [PubMed]
Torres, J.F.; Hadjout, D.; Sebaa, A.; Martínez-Álvarez, F.; Troncoso, A. Deep learning for time series forecasting: A survey. Big Data 2021, 9, 3–21. [Google Scholar] [CrossRef] [PubMed]
Ali Shah, S.A.; Uddin, I.; Aziz, F.; Ahmad, S.; Al-Khasawneh, M.A.; Sharaf, M. An enhanced deep neural network for predicting workplace absenteeism. Complexity 2020, 2020, 5843932. [Google Scholar] [CrossRef]
Lima, E.; Vieira, T.; de Barros Costa, E. Evaluating deep models for absenteeism prediction of public security agents. Appl. Soft Comput. 2020, 91, 106236. [Google Scholar] [CrossRef]
Thomson, L.; Griffiths, A.; Davison, S. Employee absence, age and tenure: A study of nonlinear effects and trivariate models. Work Stress 2000, 14, 16–34. [Google Scholar] [CrossRef]
Boot, C.; Van Drongelen, A.; Wolbers, I.; Hlobil, H.; Van Der Beek, A.; Smid, T. Prediction of long-term and frequent sickness absence using company data. Occup. Med. 2017, 67, 176–181. [Google Scholar] [CrossRef] [PubMed]
Montano, I.H.; Marques, G.; Alonso, S.G.; López-Coronado, M.; de la Torre Díez, I. Predicting absenteeism and temporary disability using machine learning: A systematic review and analysis. J. Med. Syst. 2020, 44, 162. [Google Scholar] [CrossRef] [PubMed]
Notenbomer, A.; van Rhenen, W.; Groothoff, J.W.; Roelen, C.A. Predicting long-term sickness absence among employees with frequent sickness absence. Int. Arch. Occup. Environ. Health 2019, 92, 501–511. [Google Scholar] [CrossRef] [PubMed]
Edwards, M.R.; Charlwood, A.; Guenole, N.; Marler, J. HR analytics: An emerging field finding its place in the world alongside simmering ethical challenges. Hum. Resour. Manag. J. 2024, 34, 326–336. [Google Scholar] [CrossRef]
Al Zeer, I.; Ajouz, M.; Salahat, M. Conceptual model of predicting employee performance through the mediating role of employee engagement and empowerment. Int. J. Educ. Manag. 2023, 37, 986–1004. [Google Scholar] [CrossRef]
Salazar, L.H.; Fernandes, A.; Dazzi, R.; Garcia, N.; Leithardt, V.R. Using different models of machine learning to predict attendance at medical appointments. J. Inf. Syst. Eng. Manag. 2020, 5, em0122. [Google Scholar]
Lawrance, N.; Petrides, G.; Guerry, M.A. Predicting employee absenteeism for cost effective interventions. Decis. Support Syst. 2021, 147, 113539. [Google Scholar] [CrossRef]
Zupančič, P.; Boshkoska, B.M.; Panov, P. Absenteeism prediction from timesheet data: A case study. In Proceedings of the 23rd International Multiconference INFORMATION SOCIETY—IS 2020—Data Mining and Data Warehouses—SiKDD, Ljubljana, Slovenia, 5 October 2020; Volume C, pp. 49–53. [Google Scholar]
Zupančič, P.; Panov, P. The Influence of Window Size on the Prediction Power in the Case of Absenteeism Prediction from Timesheet Data. In Proceedings of the 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 27 September–1 October 2021; pp. 193–198. [Google Scholar] [CrossRef]

Figure 1. The distribution of different company types.

Figure 2. (a) Regional distribution (b) Statistical regions in Slovenia.

Figure 3. Sliding-window technique: (a) one week forecast; (b) two weeks forecast; (c) four weeks ahead forecast.

Figure 4. Results for one-week-ahead vacation leave prediction for all four evaluation metrics: MAE, MSE, RMSE and CC.

Figure 5. Results for two-week-ahead vacation leave prediction for all four evaluation metrics. The evaluation metrics are shown separately for each week (W1 and W2), with the “Average” line representing the mean value of the metric across both weeks.

Figure 6. Results for one month ahead vacation leave prediction for all four evaluation metrics. The evaluation metrics are shown separately for each week (W1, W2, W3 and W4), with the “Average” line representing the mean value across all weeks.

Figure 7. Results for the RMSE metrics comparing single-target and multi-target models for vacation leave prediction. The comparison results are presented for each week separately. The ST and MT average lines show the averages of the RMSE error for all four weeks in the single-target and multi-target setting respectively.

Figure 8. Results for the CC metrics comparing single-target and multi-target models for vacation leave prediction. The comparison results are presented for each week separately. The ST and MT average lines show the averages of the CC metrics for all four weeks in the single-target and multi-target setting respectively.

Figure 9. Results for one-week-ahead sick leave prediction for all four evaluation metrics: MAE, MSE, RMSE and CC.

Figure 10. Results for two-week-ahead sick leave prediction for all four evaluation metrics. The evaluation metrics are shown separately for each week (W1 and W2), with the “Average” line representing the mean value of the metric across both weeks.

Figure 11. Results for one month ahead sick leave prediction for all four evaluation metrics. The evaluation metrics are shown separately for each week (W1, W2, W3 and W4), with the “Average” line representing the mean value across all weeks.

Figure 12. Results for the RMSE metrics comparing single-target and multi-target models for sick leave prediction. The comparison results are presented for each week separately. The ST and MT average lines show the averages of the RMSE error for all four weeks in the single-target and multi-target setting respectively.

Figure 13. Results for the CC metrics comparing single-target and multi-target models for sick leave prediction. The comparison results are presented for each week separately. The ST and MT average lines show the averages of the CC metric for all four weeks in the single-target and multi-target setting respectively.

Table 1. The structure of the data instances used for the learning task. (a) Single-target regression; (b) Multi-target regression (example for 4 week ahead prediction).

(a)
Descriptive Attributes		Target Attribute
Demographic profile	Absence profile (up to week K)	Absence in $W_{K}$ (number of days)
(b)
Descriptive Attributes		Target Attributes
Demographic profile	Absence profile (up to week K)	Absence in $W_{k}$	Absence in $W_{K + 1}$	Absence in $W_{K + 2}$	Absence in $W_{K + 3}$

Table 2. The structure of demographic profile attributes.

Attribute Name	Type	Description
CompanyType	nominal	Company type by specific categories. We have define 33 different company types.
Region	nominal	The region in which the employee’s company is located. We have define 12 different regions.
EmploymentYears	numeric	Describes how many years the person has been employed by the current company.
WorkHour	numeric	Data indicating how many hours per day an employee is employed by contract.
JobType	nominal	Describes type of job (e.g., permanent, part-time). We have define 6 different job types.

Table 3. Overview of statistical lagged attributes. The notation ‘XY’ specifies the time period over which these attributes are calculated, including 1 week, 2 weeks, 1 month, 3 months, 6 months, and 12 months. The ‘Z’ notation indicates the specific day of the week for which the attributes are aggregated, where ‘1’ (Monday) represents the first day of the week, up to ‘7’ (Sunday).

Attribute Name	Type	Description
PreviousXYVacationLeave	numeric	Number of days an employee was on a vacation leave in a defined period.
PreviousXYSickLeave	numeric	Number of days an employee was on sick leave in a defined period.
PreviousXYVacationLeaveShortTerm	numeric	Number of days an employee was on a vacation leave in a defined period for short term (max three days).
PreviousXYVacationLeaveLongTerm	numeric	The number of days an employee was on a vacation leave in a defined period for long term (more than three days).
PreviousXYSickLeaveShortTerm	numeric	The number of days an employee was on a sick leave in a defined period for short term (max three days).
PreviousXYSickLeaveLongTerm	numeric	The number of days an employee was on a sick leave in a defined period for long term (max three days).
PreviousXYVacationLeaveDayZ	numeric	The number of days an employee was on a sick leave in a defined period for a specific day in a week.
PreviousXYSickLeaveDayZ	numeric	The number of days an employee was on a sick leave in a defined period for a specific day in a week.

Table 4. The structure of the seasonal lagged attributes where X present week number.

Attribute Name	Type	Description
WeekPublicHolidayK+X	numeric	Count of absence days of public holidays.
WeekWinterHolidayK+X	numeric	Count of absence days in winter holidays.
WeekSpringHolidayK+X	numeric	Count of absence days in spring holidays.
WeekSummerHolidayK+X	numeric	Count of absence days in summer holidays.
WeekAutumnHolidayK+X	numeric	Count of absence days in autumn holidays.
WeekWinterSeasonK+X	numeric	Count of absence days in winter season.
WeekSpringSeasonK+X	numeric	Count of absence days in in spring season.
WeekSummerSeasonK+X	numeric	Count of absence days in in summer season.
WeekAutumnSeasonK+X	numeric	Count of absence days in in autumn season.

Table 5. The structure of the target attributes where X present week number.

Attribute Name	Type	Description
WeekVacationLeaveK+X	numeric	Target attribute for predicting vacation leave for defined week.
WeekSickLeaveK+X	numeric	Target attribute for predicting sick leave for defined week.

Table 6. Summary of research questions.

No.	Topic	Question	Description
Q1	Short-term predictive performance	How well can we predict employee absences for the upcoming week using data up to one week prior, using a single-target prediction model?	Evaluates the effectiveness of single-target models in predicting immediate future absences, crucial for short-term staffing decisions.
Q2	Medium-term predictive performance	What is the predictive performance of models for employee absences two weeks in advance using multi-target prediction?	Assesses the utility of multi-target predictions for medium-term resource allocation and planning, capturing dependencies between consecutive weeks.
Q3	Long-term predictive performance	How effectively can our multi-target prediction models predict absences one month ahead (represented as four weeks ahead learning problem)?	Aids in strategic planning by determining how well multi-target approaches predict extended absence patterns.
Q4	Comparison of prediction approaches	Which yields better results in terms of predictive performance for one-month ahead predictions: single-target or multi-target prediction models?	Compares single-target and multi-target models to identify the optimal approach for long-term prediction.
Q5	Model comparison across time frames	Does the use of ensemble methods improve predictive performance consistently across all three forecasting periods compared to a baseline single model?	Examines whether ensemble methods, which combine multiple model predictions, offer superior accuracy and robustness compared to single models.
Q6	Influence of predictive attributes	Which attributes exert the most significant influence on predicting employee absences for one week, two weeks, and one month ahead for both vacation leave prediction and sickleave prediction?	Identifies key attributes that significantly impact prediction accuracy, informing targeted data collection and feature engineering.

Table 7. Summary of evaluation metrics used in predicting employee absences.

Metric	General Description	Context-Specific Description
Mean Absolute Error (MAE)	Measures the average magnitude of errors between predictions and actual outcomes, without considering the direction of these errors.	Particularly valuable in our context as it quantifies the average error in the predicted number of absence days compared to actual days absent.
Mean Squared Error (MSE)	Penalizes larger errors more heavily by squaring the error terms, which emphasizes larger deviations more than smaller ones.	Useful in our study because larger prediction errors can significantly affect organizational planning and resource allocation.
Root Mean Squared Error (RMSE)	Provides a scale-sensitive measure of error magnitude, expressed in the same units as the predicted outcome.	Critical for comparing the performance of our models against a practical threshold of acceptable error, aiding in decision-making for model deployment.
Pearson Correlation Coefficient (CC)	Measures the linear correlation between the predicted and actual values, indicating the degree to which these values co-vary.	A high CC in our analysis suggests that the model predictions align well with actual employee absences, reflecting accurate trend capture over time.

Table 8. Attribute importance using Random Forests for vacation leave prediction with forecasting horizons of one week, two weeks, and one month. Attributes are presented in descending order of importance, as determined by the impurity function. Only the top ten attributes are shown.

One-Week Ahead Prediction	Two-Week Ahead Prediction	One-Month Ahead Prediction
Previous12MonthSickLeaveDay5	Previous1MonthVacationLeaveDay6	Week_SpringSeason_K+2
Previous12MonthVacationLeaveDay5	Week_WinterSeason_K+1	Week_WinterHoliday_K+3
Previous12MonthVacationLeaveDay4	Previous2WeekSickLeaveDay6	Week_WinterSeason_K+4
Previous12MonthSickLeaveDay4	Previous12MonthVacationLeaveDay4	Week_WinterSeason_K+3
Previous12MonthVacationLeaveDay3	Previous12MonthSickLeaveDay2	Week_SpringSeason_K+3
Previous6MonthVacationLeaveDay4	Week_WinterSeason_K+2	Week_SpringSeason_K+1
Previous12MonthVacationLeaveDay1	Previous12MonthSickLeaveDay5	Week_SpringSeason_K+4
Week_WinterSeason_K+1	Previous3MonthSickLeaveDay5	Week_WinterHoliday_K+2
Previous3MonthVacationLeaveDay5	Previous6MonthVacationLeaveDay5	Week_WinterSeason_K+2
Previous12MonthSickLeaveDay1	Week_AutumnSeason_K+2	Week_SpringHoliday_K+4

Table 9. Attribute importance using Random Forests for sick leave prediction with forecasting horizons of one week, two weeks, and one month. Attributes are presented in descending order of importance, as determined by the impurity function. Only the top ten attributes are shown.

One-Week Ahead Prediction	Two-Weeks Ahead Prediction	One-Month Ahead Prediction
Previous12MonthSickLeaveDay5	Previous6MonthSickLeaveDay5	Previous12MonthSickLeaveDay5
Previous12MonthVacationLeaveDay4	Previous12MonthVacationLeaveDay5	Previous12MonthVacationLeaveDay5
Previous3MonthVacationLeaveDay5	Previous6MonthSickLeaveDay4	Previous12MonthSickLeaveDay4
Previous6MonthVacationLeaveDay5	Previous12MonthSickLeaveDay4	Previous12MonthVacationLeaveDay4
Previous1MonthSickLeaveDay6	Previous2WeekSickLeaveDay6	Previous6MonthVacationLeaveDay4
Previous3MonthSickLeaveDay5	Previous3MonthVacationLeaveDay5	Previous12MonthVacationLeaveDay2
Previous12MonthVacationLeaveDay1	Previous12MonthVacationLeaveDay2	Week_SummerSeason_K+1
Previous12MonthSickLeaveDay3	Previous12MonthVacationLeaveDay3	Week_SummerHoliday_K+4
Previous6MonthSickLeaveDay4	Previous6MonthVacationLeaveDay4	Week_SpringSeason_K+1
Previous6MonthVacationLeaveDay4	Previous3MonthSickLeaveDay3	Week_SummerHoliday_K+3

Table 10. Summary of research question findings.

Research Question	Findings
Q1: How well can we predict one week ahead of an employee absence using historical data?	Bagging methods performed best for vacation leave, while Random Forest was most effective for sick leave, demonstrating strong predictive accuracy in both cases.
Q2: How effective are the predictive models at forecasting two weeks ahead of employee absence?	Optimal predictions were achieved one week in advance using the best methods identified for each type of leave, with Bagging and Random Forest outperforming other models.
Q3: What is the predictive accuracy for one-month-ahead employee absence predictions?	Bagging continued to show superior results for vacation leave predictions, and Random Forest excelled in sick leave predictions even at this extended forecast interval.
Q4: How do single-target vs. multi-target prediction models perform in predicting employee absence?	Bagging consistently demonstrated high performance across both single-target and multi-target prediction scenarios, affirming its robustness in handling various prediction tasks.
Q5: How effective are the ensemble methods compared to the single models in predicting employee absence across different time frames?	Ensemble methods, particularly Bagging and Random Forest, consistently outperformed single predictive models, highlighting their greater robustness against overfitting and enhanced predictive accuracy.
Q6: How do attributes impact predictive models and which are the most influential in predicting one week, two weeks, and one month ahead?	Feature importance analysis revealed that time-related attributes and specific absence history metrics (such as past sick leave and vacation patterns) were critical to enhance the predictive performance of the models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zupančič, P.; Panov, P. Predicting Employee Absence from Historical Absence Profiles with Machine Learning. Appl. Sci. 2024, 14, 7037. https://doi.org/10.3390/app14167037

AMA Style

Zupančič P, Panov P. Predicting Employee Absence from Historical Absence Profiles with Machine Learning. Applied Sciences. 2024; 14(16):7037. https://doi.org/10.3390/app14167037

Chicago/Turabian Style

Zupančič, Peter, and Panče Panov. 2024. "Predicting Employee Absence from Historical Absence Profiles with Machine Learning" Applied Sciences 14, no. 16: 7037. https://doi.org/10.3390/app14167037

APA Style

Zupančič, P., & Panov, P. (2024). Predicting Employee Absence from Historical Absence Profiles with Machine Learning. Applied Sciences, 14(16), 7037. https://doi.org/10.3390/app14167037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Employee Absence from Historical Absence Profiles with Machine Learning

Abstract

1. Introduction

2. Background and Related Work

2.1. The Problem of Predicting Employee Absence

2.2. Formulating the Prediction of Employee Absenteeism as a Machine Learning Task

2.3. Modelling Approach Selection

2.4. Predicting Employee Absence with Predictive Clustering Trees

2.5. Related Work

3. Employee Absenteeism Data

3.1. Data Sources

3.2. Data Attributes and the Structure of the Data Instances

3.3. Methodology for Dataset Construction

4. Experimental Design

4.1. Research Objectives and Questions

4.2. Datasets

4.3. Evaluation Metrics

4.4. Methods and Experimental Setup

5. Results

5.1. Vacation Leave Prediction

5.1.1. Short-Term Predictive Performance (Single-Target Prediction)

5.1.2. Medium-Term Predictive Performance (Multi-Target Prediction)

5.1.3. Long-Term Predictive Performance (Multi-Target Prediction)

5.1.4. Comparison of Prediction Approaches

5.1.5. Single-Model vs. Ensemble Vacation Leave Prediction

5.1.6. Attribute Importance with Random Forests for Vacation Leave Prediction

5.2. Sick Leave Prediction

5.2.1. Short-Term Predictive Performance (Single-Target Prediction)

5.2.2. Medium-Term Predictive Performance (Multi-Target Prediction)

5.2.3. Long-Term Predictive Performance (Multi-Target Prediction)

5.2.4. Comparison of Prediction Approaches

5.2.5. Single-Model vs. Ensemble Sick Leave Prediction

5.2.6. Feature Ranking with Random Forests

6. Discussion

6.1. Summary of Research Question Findings

6.2. Model Scope and Specificity

6.3. Ethical Issues

6.4. Model Updates and Maintenance

7. Conclusions and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI