1. Introduction
Construction projects are well-known for their risk and unpredictability [
1]. This is especially true during the project’s early stages, such as the feasibility and planning phases, as the amount of available information is severely limited. Construction projects continuously generate vast quantities of data throughout the different stages, including design data, schedules, financial data, enterprise resource planning (ERP) systems, and other information, as noted by Bilal et al. [
2]. Knowledge management in the construction industry involves an integrated approach that combines information technology with social techniques [
3]. Access to and management of the right data at the right time play a crucial role in determining the success of a project. Lee et al. [
4] indicated that managing construction project-related knowledge can be challenging as it is typically dispersed among various stakeholders, including owners, supply chains, organizations, employees, and customers, making it difficult to conduct comprehensive data analysis.
Effective management of project factors is crucial for successful construction projects [
5]. In today’s construction industry, construction professionals rely on scheduling and progress monitoring to ensure the timely completion of projects [
6]. However, during the early stages of the project, the amount of detailed available data related to the project is limited. This makes preparing accurate schedules and cost estimates nearly impossible and adversely affects resource management, production rates, execution strategies, and overall decision-making. Collecting large amounts of high-quality data that fully represent the real construction world has significant limitations [
7]. Currently, the significant challenges facing machine learning (ML) are insufficient training data, unstructured data, low-quality data, and training data with irrelevant features, which lead to underfitting and overfitting [
8,
9,
10]. Similarly, ref. [
11] specifically highlighted the issue of the ineffective utilization of machine learning algorithms resulting from proper data limitations and information constraints. Various studies have used metaheuristics algorithms such as genetic algorithms (GA) to solve scheduling problems that arise from limited data—some even developing complex models to address real requirements, such as considering multiple projects [
12], changing resource constraints [
13], and simultaneously considering resource-constrained and time–cost trade-off problems [
14]. However, Wang et al. [
15] identify the difficulty of data collection as a significant flaw in these studies; therefore, it is difficult to apply these methods to practical projects. Accordingly, ML and heuristic models should consider two major issues: the first is a structured data source to provide the time, cost, resources, greenhouse gas (GHG) emissions, etc., for project planning; the second is a suitable transfer method to move the data from the data source to the mathematical model such as BIM.
Pan and Zhang [
16] highlighted that, as a field of computer science, AI enables computers to perceive and learn input much like humans do. It encompasses knowledge representation, reasoning, problem-solving, and planning to address complex and ambiguous problems in a deliberate, intelligent, and adaptive manner. Investment in AI is experiencing rapid growth, with machine learning playing a significant role in exploring robust data from multiple sources and leveraging insights for informed decision-making. They also discussed the current gap in the adoption of AI techniques within construction projects, as it still lags behind other industries despite the exponential increase in generated data. Consequently, there is substantial interest in implementing various AI methods in the AEC industry to utilize the valuable opportunity presented by digital evolution for improved performance and profitability. One of the primary applications of artificial intelligence lies in the domain of “From Data to Decision” (FD2D). This term focuses on digital transformation and artificial intelligence from the data value chain to human value [
17]. It involves developing and studying digital transformation and artificial intelligence applications across various academic disciplines and industrial sectors, and their impact on society. FD2D is a framework designed to convert diverse raw data into actionable insights using organized workflows. In construction, FD2D consists of three key steps: First is data acquisition, which standardizes and integrates scattered project information such as schedules, budgets, resources, and embodied carbon emissions. Second, knowledge extraction involves using analytics tools like AI and machine learning to identify patterns. Lastly, decision support translates these insights into actionable operational strategies, focusing on areas such as risk mitigation and resource optimization. The proposed Integrated Planning and Control Framework (IPCF) operationalizes FD2D’s first step by tackling the challenge of data fragmentation in construction. Specifically, IPCF’s Construction Data Hub (CDH) standardizes data collection across 12 dimensions (
Section 4), facilitating downstream FD2D processes such as predictive modeling (
Section 6). Therefore, IPCF is fundamental to FD2D’s objective of data-driven construction management.
Building Information Modeling (BIM) is a significant advancement in architecture, engineering, and construction (AEC). BIM technology creates accurate digital models of buildings, enhancing design analysis and control over manual processes. These models include precise geometry and data to support construction, fabrication, and procurement [
18]. BIM also models a building’s lifecycle, enabling new design and construction capabilities and evolving project team roles. When effectively implemented, BIM leads to more integrated design and construction, resulting in higher quality buildings, reduced costs, and shorter project durations. BIM has emerged as a crucial tool for structuring and sharing project data among stakeholders, particularly in design and coordination [
16]. However, BIM’s focus on geometric and procedural data often neglects essential dimensions such as labor productivity, environmental impact, and social metrics, which remain scattered across disparate systems. The proposed CDH enhances BIM’s capabilities by integrating these non-geometric dimensions into a unified framework, enabling comprehensive analysis. For instance, while BIM may track material quantities, the CDH associates this data with cost, carbon emissions, and supplier histories, bridging gaps in early-stage decision-making.
This research focuses on resolving the first issue found in the previous models: collecting and structuring the data model. It employs the concept of FD2D in the construction industry to develop a data-driven decision-making tool called Integrated Planning and Control Framework (IPCF), by proposing a methodology (
Figure 1) that transforms the significant volume of scattered construction data in various formats into valuable decisions. First, by collecting comprehensive data and appropriate metadata from projects during different construction project stages. Second, linking these data points to one or more construction project phases, then developing a data acquisition model called Construction Data Hub (CDH) to collect data from multiple projects and organizations in a format ready for applying machine learning and artificial intelligence techniques. The CDH uses mapping tables to standardize terminologies such as management levels, project phases, and categories. Lastly, utilizing the collected data as input to knowledge-based decision-making (KBDM) to derive valuable decisions (discovered knowledge) to improve future projects. The last step presents the feedback loop for continuous improvements.
4. Dimensions Impacting Project Planning and Control
To develop an inclusive framework for the data acquisition model, the authors sought to identify the dimensions that impact the planning and control of a project. In 2021, Taghinezhad et al. [
45] performed a comprehensive literature review to identify the project management dimensions related to the successful delivery of transportation projects. They identified twelve dimensions: time management, cost management, quality control and inspection, environmental process, right of way and utilities, safety, outsourcing, value engineering, change orders, type of contract, workforce qualification, operation, and maintenance. This research employs the dimensions identified by Taghinezhad et al. [
45] and performs further analysis of the applicability of these dimensions in construction projects in general and the availability of data related to these dimensions.
Based on the authors’ analysis of the literature and practical experience as project managers, this study examines twelve dimensions that need proper identification and standardization: scope management, planning and scheduling, cost estimating, cost control, labor resources, materials management, equipment management, progress measurement and performance evaluation, document management, risk management, environmental impact, and social impact. Two dimensions (quality and safety) were excluded from the framework as these dimensions are managed using different systems. After consulting industry experts and professionals on construction projects, who concurred with this list of dimensions, the authors developed the data acquisition model, suggesting some improvements for the current practices and identifying relevant metadata to be considered for each dimension while collecting the project’s data.
To summarize the selection criteria, a rigorous two-stage process guided the selection of the twelve dimensions for the CDH framework. First, a literature review of construction project management systems, such as the PMBOK and domain-specific studies [
20,
45], identified recurring critical factors across successful projects. Second, these dimensions were validated through consultations with industry experts. Accordingly, dimensions were included based on three criteria: (1) empirical evidence of influencing project success (e.g., cost/time performance), (2) alignment with emerging industry priorities (e.g., environmental/social dimensions), and (3) feasibility of standardized data collection (e.g., work package-level GHG metrics). This dual academic–practical approach ensures that the CDH addresses both the foundational and forward-looking needs of the AEC sector.
Structuring a comprehensive data acquisition model that includes all the above-mentioned dimensions required a significant analysis of all the stakeholders and entities involved in executing the project. Entities here start at a high level to include the owner, consultant, construction organization, and subcontractors, and end at the level of resource hours (i.e., labor, material, and equipment) required to execute the project. The CDH is based on the integration between the four main structures in any construction project: OBS, PBS, RBS, and WBS.
Figure 3 depicts this relationship, starting with an organization that has one or more divisions, each of which has multiple business units comprising various departments. Each department owns multiple types of resources, such as labor, material, and equipment. Each business unit is accountable for managing one or more portfolios to attain the strategic goals and objectives of the organization. A portfolio consists of multiple programs: a program is formed by a group of related projects, and the completion of the program’s goals relies on the successful execution of all these projects. A project has several phases, and each phase has disciplines, such as architecture, civil, structural, mechanical, electrical, etc. Each project is then decomposed into many work packages, and each work package is specific and large enough to describe the work executed to produce a product in a construction project. Accordingly, we can conclude that the key element here is the work package, as it connects the four structures. This highlights its value and the significant need for proper utilization of it as a knowledge carrier between the organization’s projects. The integration mechanics are explicitly illustrated later in
Section 5, which serves as the relational model for the system.
During the planning phase, the amount of information, such as schedules, estimates, designs, drawings, and other documents, is not detailed enough to prepare proper estimates or select the optimal construction execution method. The conventional work package only includes data related to the cost and time required to execute that package. So, to improve the quality of the conventional work package, this research introduces a new term called “multidimensional work package” (MDWP). The Multidimensional Work Package (MDWP) extends Traditional WPs by integrating 12 dimensions (
Figure 4), including scope, resources (e.g., labor, material, and equipment), real-time progress measurement, GHG emissions, risks, etc.
Table 1 illustrates the benefits of MDWPs vs. Traditional WPs. Each MDWP uses weighted progress activities (WPAs) to dynamically calculate percentage completion based on predefined Rules of Credit (ROC). For instance, an MDWP for “Steel Beam Installation” might assign weights to sub-tasks (e.g., 30% for delivery, 50% for welding, 20% for inspection), enabling precise progress tracking tied to resource use and emissions (
Section 4.5). The MDWPs are entities; hence, each MDWP has attributes that illustrate its particular properties. In this research, the attributes are, namely, unit of measure (UoM), labor hours per unit, equipment hours per unit, material per unit, GHG emissions (Kg of CO
2 equivalent per unit), and cost per unit. These attributes are crucial as they structure the main properties of each MDWP and have their equivalent key quantities.
4.1. The 1st Dimension: Scope Management
Every dimension has its related data; these data are usually stored in different systems or software. The first dimension, scope management, is the core element of project management dimensions, as it requires a precise and clear definition for the project work and impacts other dimensions such as time, cost, resources, etc. Scope management in construction is the process of defining, documenting, and controlling the work required for a project (PMBOK Guide). It is critical for completing projects on time, within budget, and to the required quality standards. The scope is mainly represented through a WBS, which is a decomposition of the full project into multi-work packages. These work packages utilize multiple resources at the same time (i.e., labor, material, and equipment) and are executed through major phases such as engineering, procurement, and construction. Other details included in this dimension are the key quantity name and unit of measure, baseline key quantity amount, actual key quantity amount, list of change orders, and the responsible resource for each work package.
4.2. The 2nd Dimension: Planning and Scheduling
Planning and scheduling is considered a significant dimension for the MDWP. In this research, the authors propose a generic template for WBSs by initiating two predefined libraries for product breakdown structure (PBS) and resources breakdown structure (RBS), to attain the standardization concept and improve the quality of the schedules and WBS developed by planners.
Generic Template for Work Breakdown Structure (WBS)
Starting with the generic template for the WBS, the structure of the template is based on the two predefined libraries: product breakdown structures (PBS) and resources breakdown structures (RBS). The PBS library contains the full scope of construction projects represented in the form of smaller, manageable work items as predefined work package types. This library is built in CDH by initiating all the work package types as activity codes. Those standardized packages can be utilized to build specific WBS for any future project while maintaining the standardization of levels, phases, subphases, and terms for all the organization’s projects. Similarly, another library is initiated for RBS to represent the predefined resources for labor, contractors, materials, and equipment. The template has a hierarchical structure and consists of four levels (
Figure 5). The first two levels are project-specific: level 1 contains the project description, while level 2 divides the project into areas (plants or units) and components. To increase data collection consistency, each area in level 2 is further divided into phases and sub-phases. These phases and sub-phases are necessary to categorize the work packages at level 3, and are pre-defined as follows:
- -
General;
- -
Planning/Scheduling;
- -
Cost Estimating;
- -
Cost Control.
- -
Preliminary;
- -
Conceptual;
- -
Detailed.
- -
Requisition and Awarding; purchase orders and subcontracts;
- -
Materials Management;
- -
Contract Administration.
- -
Manufacturing and Fabrication;
- -
Module Assembly;
- -
Site Installation.
Level 3 contains generic work package types that are extracted from the PBS library and present scope deliverables or product packages. Level 3 is the most significant level of the proposed template because it represents the whole scope of work divided into predefined work packages, such as columns, beams, foundations, structural steel, concrete, etc. However, the work packages here are not limited to the work executed during the construction phase. Some work packages may go under the engineering or procurement phase, while others, such as schedule and budget control baselines, may not go under any phase. Hence, to structure a consistent template, a new phase category called “General” is added to include all the work packages that do not represent work conducted in the engineering, procurement, or construction phases. The authors propose the term “general work packages” (GWP) to be consistent with similar terms used in the industry, such as “engineering work packages” (EWP), “procurement work packages” (PWP), and “construction work packages” (CWP). The final level of the WBS, level 4, contains progress activities for each work package.
To structure a project WBS from scratch using the suggested framework in
Figure 5, the planner should refer to the predefined libraries for PBS and RBS and assign all the requirements to work package types at level 3. The generic structure for WBS provides a comprehensive planning template that collects data consistently and meaningfully to facilitate further analysis and make sustainable decisions. Standardizing the terminologies with predefined levels, phases, and subphases enables the identification of meaningful metadata that can be used to capture data from multiple projects and organizations. Creating predefined work package types also enables meaningful data collection, and it facilitates data transfer.
4.3. The 3rd and 4th Dimensions: “Cost Estimating” and “Cost Control”
The third and fourth dimensions, “Cost Estimating” and “Cost Control”, are also covered under the proposed structure for WBS. Planners can collect the data related to these dimensions by adding more columns, such as actual, budgeted, and variance labor or material cost, to collect project details like baseline amount and rate, and actual amount and rate, for each resource required to complete a WP.
4.4. The 5th, 6th, and 7th Dimensions: “Labor Resources Management”, “Materials Management”, and “Equipment Management”
Similarly, planners can also collect the data related to these dimensions by referring to the predefined libraries for each branch of the RBS and tracking the utilization of the resources for each WP. The RBS’s three sub-libraries are for labor, materials, and equipment.
4.5. The 8th Dimension: “Progress Measurement and Performance Evaluation” Using Weighted Progress Activities (WPA)
To collect reliable data that can be utilized during the execution of a project or analyzed after the closure of the project, this research proposes a progress measurement system. This system evaluates the project progress and performance by assigning weights for the predefined progress activities based on the Rules of Credit (ROC) term and then measuring the percentage of completion for each WP. Lopez [
46] defined the Rules of Credit (ROC) as “referring to the guidelines by which the physical progress of a project is evaluated, assigning value to each milestone met. These milestones are reflected in a percentage of the total contract and serve as the basis for determining when and how much the contractor should be paid for completed work. Proper application of these rules is essential to keeping the project on schedule and on budget”. In other words, the project’s progress is measured by tracking the percentage of completion of the predefined weighted progress activities (WPA). The project’s performance is similarly evaluated by taking the planned values (PV) from P6 Primavera and the actual values (AV) from the timesheet system at the construction site, then calculating the earned values (EV) for each WP. The system has a “Status” column to show the project’s current status and a “Comments” column for planners to record the causes for deviations from the original budget and schedules.
Figure 6 is an example of WPA for a concrete module work package (construction work package) to measure the percentage of completion. The proposed tool requires identifying the progress activities (PA) and then assigning weights to these activities. The weights assigned to the activities, which include module assembly, module transportation, and site installation, are 60%, 10%, and 30%, respectively. After that, the planner can utilize it to measure the progress during the execution phase by specifying the actual finish date and the letter “A” that refers to “Accomplished” when the work is completed. Accordingly, the EV, variance percentage of completion, and current status are calculated for the specified work package.
4.6. The 9th Dimension: “Document Management”
Document management is a critical component of project management that guarantees both the effective handling and accessibility of project-related documents, contributing to project success, compliance, and risk mitigation. Construction documents usually include blueprints, drawings, contracts, permits, invoices, change orders, and correspondence. Although document management often involves the use of specialized software tools and systems, the proposed CDH tool suggests collecting a list of IFC (Issue for Construction) documents and as-built documents at this stage, while other documents might be added in the future.
4.7. The 10th Dimension: “Risk Management”
Managing risks in construction involves identifying, assessing, prioritizing, and minimizing potential risks that could affect the success of a construction project. Effective risk management enables construction firms and project teams to anticipate and resolve such challenges to minimize negative impacts on project timelines, budgets, quality, and safety. In this research, the CDH suggests collecting lists of risks for each WP type registered in the PBS library. Each list includes details such as risk name, description, magnitude, and probability. It also includes the risk impact on the scope, schedule, and budget of the project.
4.8. The 11th Dimension: “Environmental Impact”
In addition to air, water, soil, and noise pollution, construction can also harm the environment through waste generation, habitat destruction, and energy consumption. The proposed CDH tool, therefore, suggests collecting data related to GHG emissions and solid waste generated during the execution of a work package. This includes measuring GHG emissions and resource consumption (fuel, electricity, etc.) during the manufacturing of the building material/module, the transportation of the material/model to the site, and the on-site construction phase. It also includes collecting data related to the amount of solid waste generated during the construction phase at the work package level. This data can then be analyzed to identify the work packages that generate more GHG emissions/solid waste and to find execution or material alternatives for these packages.
4.9. The 12th Dimension: “Social Impact”
The social impacts of construction projects occur during and after the construction and may vary depending on the project’s nature, scale, and location. Among the significant social impacts of construction projects are the creation of employment opportunities for residents, the improvement of infrastructure that can attract businesses and investments, and the enhancement of residents’ overall quality of life by providing better access to essential services. However, such projects can also lead to temporary traffic congestion and detours, which may inconvenience nearby residents and businesses. Although this research does not cover the specific metadata, fields, or methods for measuring social impacts, the flexibility of the proposed CDH enables the user to customize the data acquisition model and integrate this dimension into their project management analysis.
5. Developing Construction Data Hub (CDH)
After analyzing the articles mentioned earlier in the literature and employing the twelve dimensions that impact planning and control in these studies, an evaluation of the addressed dimensions, associated phases, and the detailed work level from each study is shown in
Table 2. All the studies that utilized data acquisition models were custom-tailored to their respective research objectives. While these models were effectively utilized and yielded satisfactory results in knowledge discovery, they lacked generality. Implementing these models in different scenarios might be challenging, either due to their lack of adaptability or the need for extensive modifications. The results shown in
Table 2 indicate that none of the previous studies addressed the twelve dimensions while focusing on collecting project data at the work package level, which is the most efficient detailed work level. The proposed CDH model supports structuring a comprehensive data warehouse at the work package level and seeks to collect construction project data from twelve dimensions that impact the project’s planning and control.
An appropriate structure for the CDH requires the proper identification of all the entities that are involved in generating construction project data and the precise relationships between them. The authors divided the entities into two groups: organization-related entities and project-related entities. Organization-related entities refer to portfolios, programs, projects, and resources, whereas project-related entities are phases, disciplines, and work package types (also referred to as “MDWPs”). The proposed CDH adopts a relational database structure.
Figure 7 presents the entity–relationship diagram (ERD) that illustrates the various entities within the database and their interconnections. Each of the entities included in the ERD has several attributes that describe that entity. The MDWP entity is the main entity upon which the rest of the ERD is based, since the MDWP is the product of the integration between the OBS, WBS, RBS, and PBS. The integration is shown through the relation between the significant entities, such as the organization and the project.
The authors used SQLite to build the client/server application, which demonstrates the conceptual model of the CDH (
Figure 8).
Figure 8a shows the main page of the CDH that presents the entities such as companies, portfolios/programs, projects, etc.
Figure 8b shows an example of an organization’s portfolio that includes one program—namely, program 1- and
Figure 8c shows an example of a project’s WBS and its related WPs (showing attributes such as unit of measure (UOM), phases, resource details, etc.). Through the developed CDH, users can input new data manually or import it from Primavera P6, Microsoft Project, Microsoft Access, or Excel files. The CDH can be customized to meet the needs of its users since several entities and attributes can be added as required.
The CDH uses the concept of dynamic breakdown structure, which enables the user to perform flexible grouping; for instance, work packages can be grouped according to discipline, phase, sub-phases, etc. Stakeholders can utilize the CDH to initiate a new project by entering the data points related to the project details. Data points include the portfolio or program that belongs to the project, the business unit or department that manages it, the phases of the project, the duration, and the estimated budget for each work package. The MDWP can then be populated with other data points as required, such as those related to resources, environmental impacts, etc. The developed database acts as a tool for collecting and storing detailed data about projects, which can be easily accessed in one place during and after the completion of a project. Users can also track the current status of a project during its execution phase by performing a simple query about the number of utilized resources in hours, or the amount of GHG emissions generated up to date, etc.
6. Case Study
An actual case study is illustrated to show the applicability of the proposed Construction Data Hub (CDH) in real-life situations and to prove that collecting construction project data from the proposed twelve dimensions is significant for tackling construction project difficulties. A machine learning model that utilizes the data collected through the CDH is developed to analyze the factors influencing construction projects’ profit. The case study data set was obtained from a private construction organization that carries out construction works in Alberta, Canada. Their construction projects include government buildings, residential buildings, universities, schools, hospitals, parks, playgrounds, infrastructure works, etc.
6.1. Factors Influencing Construction Project Profits
Accurately estimating profit margins for construction projects is a crucial decision that construction firms’ estimators make during the initial design phases. Nevertheless, this task is challenging due to various external, organizational, and project-related factors impacting profit [
47,
48,
49]. Project teams rely on intuition or uniform rates, which are not always reliable methods for determining profit margins. The construction team might also consider other factors/attributes available in their databases. Organizational attributes such as divisions, business units, portfolios, and programs might influence the profit margin. More attributes that are project-related need to be considered, too, such as the joint venture partner, project financing model (PFM), project delivery method (PDM), detailed location, late completion penalty, and the project’s architect/engineer. Bilal and Oyedele [
50] pointed out that project complexity, including risks, opportunities, and the distance the construction route covers across rivers, roads, rails, and utilities, are significant factors. They also emphasized the importance of resource allocation in predicting profit margins. The key resources include the project manager (PM), quantity surveyor (QS), commercial manager, design manager, suppliers, and subcontractors.
6.2. Data Investigation and Analysis
The original data set obtained from the organization consisted of 2018 construction projects from disciplines such as residential, commercial and institutional, infrastructure, and industrial. It included projects executed between 2004 and 2024. Only projects with 100% execution percentages were selected to be included in the research data set, while other projects with execution percentages less than 100% were discarded. Although the organization invested a lot of hours collecting data from different data sources and project management software, it was found that the collected data were still not ready for the data mining process. Soibelman and Kim [
51] highlighted the reason behind this issue: the absence of clearly defined, automated mechanisms to extract, process, analyze data, and summarize results for construction managers.
The original data set included 186 fields. Hence, the authors performed data cleaning and further investigation processes for the data set, which revealed some main issues such as data duplications and combining many fields. These issues were mainly due to the way data were collected, as some information was in MS Excel sheets, which was different from the information collected in other management software, leading to low data quality and analytic suitability. For instance, some fields combined significant information such as project delivery methods (PDM), project financing models (PFM), and contract types in one field. To avoid any future data collection issues, the authors created a data dictionary to facilitate the data collection process and serve as a foundation for required analytics data. Further, a data gap analysis was performed to determine the missing fields and information. As a result, the data were organized based on seven categories: organization breakdown structure (OBS), project details, stakeholders, location, budget, cost and profit, and duration.
Although the authors could not collect data at the work package level as proposed earlier in their methodology, employing the proposed CDH idea at the project level proved its applicability. The authors collected the appropriate metadata from different structures and dimensions and then arranged and categorized the fields. They collected, cleaned, and organized construction projects’ metadata under the two structures: the OBS and RBS. The metadata related to the other two structures, WBS and PBS, were not available in the obtained data set. Also, metadata from six dimensions out of the proposed twelve dimensions was obtained. The available dimensions are scope management, planning and scheduling, cost estimating, cost control, labor resources, and document management (
Table 3). The authors integrated the metadata from the two structures and six dimensions in one data sheet and used it in their analysis. To avoid redundancy, the metadata fields were re-evaluated; for instance, the authors calculated the “approved change orders %” and removed the field of “approved change orders”. They also used the fields “baseline start date”, “baseline finish date”, “actual start date”, and “actual finish date” to calculate the new fields called the “planned duration in days” and “actual duration in days” and then discarded the dates fields. Other fields, such as “project financing model (PFM)” and “postal/zip code”, were discarded due to missing values. Thus, the resulting data set included 507 projects with 28 significant fields as profit influential factors, which will be utilized later as inputs to the machine learning model.
6.3. Data Visualization Using Online Analytical Processing (OLAP) Technique
At this stage, the data set was prepared, cleaned, and ready for visualization using an OLAP technique such as Pivot tables in Microsoft Excel. This technique is a powerful visualizing tool, as it gives the ability to summarize, analyze, explore, and present large, detailed data sets.
Figure 9 shows an example of data visualization performed during the research. It presents the number of executed projects under each portfolio and groups the portfolios according to their division. This figure clearly shows that the buildings division has executed more projects than the other divisions, while the two portfolios, buildings Alberta and buildings Prairies, have more projects than the others.
6.4. Determining Influential Factors Affecting Construction Projects’ Profit Using Machine Learning
As mentioned earlier, the final data set included twenty-eight fields nominated as profit influential factors. The authors selected Rapid Miner software (
https://docs.rapidminer.com/9.9/studio/installation/, accessed on 19 April 2025) to perform the machine learning process and utilized the random forest algorithm to develop a feature selection model that analyzes factors. Feature selection is a perfect process that can be used before creating a prediction model to reduce the inputs by identifying the most meaningful ones. The developed feature selection model uses a “weight by tree importance” operator [
52], which calculates the weight of the attributes by analysing the split points of a random forest model. The attributes with higher weight are considered more relevant and important. This weighting scheme involves using a designated random forest to ascertain the relative importance of the attributes used. To achieve this, each node of each tree is examined to retrieve the benefit generated by the split at that node. Subsequently, the benefits for each attribute that was used for a split are summed. The average benefit across all trees is then regarded as the importance of the attribute.
The proposed model examines each profit influential factor as an input and identifies its correlation with the model output, which is profit. Thus, the significant factors can be used to develop a machine learning prediction model to predict profit for future projects. In this research, the developed feature selection model calculated the weights for twenty-three factors out of twenty-eight (
Table 4). The most influential attributes with a weight more than 0.5 were baseline budget, approved change orders %, current budget, actual cost, planned duration in days, and baseline cost. The model provided zero weights for the five attributes, namely: country, city, arch/engineer, project manager, and owner, which indicates that these attributes are less relevant and important to the profit.
During split point analysis, the random forest model assigned higher weights to factors that most effectively partitioned the data set into distinct profit-based clusters. Specifically, the algorithm evaluates all possible splits across each variable (e.g., baseline budget) to maximize information gain, which is measured by metrics like Gini impurity reduction or entropy. Factors such as baseline budget (weight: 1.00) consistently created the purest subsets, indicating their predictive power. This improvement in split frequency and purity is aggregated across all trees in the forest, with the baseline budget emerging as the dominant splitter.
The baseline budget emerged as the most influential factor (weight: 1.00), as it encapsulates the project’s initial financial scale and complexity. Larger budgets typically involve more stakeholders, procurement risks, and resource dependencies, which amplify cost overrun risks. This finding aligns with Rui et al. [
49], who established that budget size is a proxy for project complexity. Approved change orders (weight: 0.86) indicate deviations from the original plan. Frequent changes often reveal insufficient feasibility studies or fluctuating client demands, diminishing profit margins due to rework and delays [
50]. The tight coupling of the current budget (0.82) and actual cost (0.56) weights underscores the impact of dynamic cost control. Projects that proactively revise budgets mitigate losses, while those with rigid plans suffer higher overruns [
30]. In contrast, variables like “city” (weight: 0.00) failed to segment the data, leading to their exclusion meaningfully. The model’s reliance on empirical split efficiency (rather than correlation alone) ensures robust, interpretable feature importance aligned with construction management realities.
Moreover, the data for many influential profit factors are available at the early stages of the project, such as the planning stage. Accordingly, these factors can be utilized as inputs for a profit prediction model. The authors highlighted that the factors available at the planning stage with a weight of more than 0.25 are relevant and important to the profit. For instance, factors such as project type, project delivery method (PDM), portfolio, contract type, program, late completion penalty, project category, baseline cost, planned duration in days, and baseline budget.
The development of CDH acts as a comprehensive data hub to be utilized to tackle various construction project issues. The outcomes of the case study emphasized the importance of collecting data from multiple dimensions and structures of a construction project. These results are demonstrated by the previously explained studies that a project’s profit is impacted by various external, organizational, and project-related factors. To reflect, in this research, many profit influential factors were under the OBS structure; namely, company, business unit, division, portfolio, and program. These were categorized as organizational factors. Other factors under the six dimensions mentioned earlier in
Table 3 were categorized as project-related factors, such as project type, project delivery method (PDM), contract type, project category, duration, budget, etc. Also, factors such as joint venture, region, and province/state were categorized as external factors. The developed model was unable to identify any correlation between the project stakeholders, such as the project manager, arch/engineer, and owner, with the profit.
The absence of work package (WP)-level data significantly limits the model’s capacity to identify micro-scale drivers of profit fluctuations. While project-level metrics (e.g., baseline budget, total duration) offer macro-level correlations with profit, they do not capture task-specific inefficiencies that cumulatively affect financial outcomes. For instance, a project may seem profitable overall yet hide critical inefficiencies in specific WPs, such as repetitive rework in electrical installations due to design clashes or material waste in structural steel fabrication resulting from poor cutting precision. These micro-scale issues often go unnoticed in aggregated data but can substantially inflate costs when compounded across multiple WPs. WP-level tracking (e.g., labor hours per cubic meter of concrete poured or defect rates per subcontractor) would allow for the precise identification of problem areas, facilitating targeted interventions like subcontractor retraining or process redesign. Without this granularity, the model’s insights remain limited to broad trends, limiting its usefulness for practical decision-making.
The main limitation of the developed model is the unavailability of a data set at the work package level, this might justify the model’s inability to identify any correlation between the project stakeholders and the profit margins. Machine learning algorithms tend to perform better with larger data sets. Therefore, as more data are collected and properly tracked using the CDH, it will be necessary to revalidate the model to ensure it continues to perform optimally.
7. Conclusions
Construction projects generate vast amounts of data. For the data to be useful, it must be relevant, accurate, and organized into data sets that are large enough to be processed into meaningful analysis. Construction practitioners, therefore, require a consistent integrated data acquisition model to collect detailed data from different dimensions across multiple projects. The proposed Construction Data Hub (CDH) integrates the four main structures of any construction project: organization breakdown structures (OBSs), product breakdown structures (PBSs), resource breakdown structures (RBSs), and work breakdown structures (WBSs). The CDH also collects data related to twelve dimensions that impact the project’s planning and control, and focuses on the work package as a connector between the four breakdown structures and a knowledge carrier across an organization’s different projects. To enhance the traditional work package, the authors have introduced the multi-dimensional work package (MDWP), which includes new dimensions of data. Since construction organizations have different work-level definitions, a generic model is required to collect detailed data at a specific level, regardless of a particular organization’s terminology. Defining standard levels, phases, and subphases is crucial to capturing meaningful metadata from various projects and organizations.
Academic studies on construction planning often face difficulties during the data collection process, which researchers consider a critical step that directly impacts the accuracy of their models. The CDH contributes to academia by supporting a structured and comprehensive data warehouse that serves as a central data facility for various stakeholders to retrieve the right data for making sustainable decisions. Furthermore, the data warehouse contains high-quality data from several dimensions that can be utilized to perform detailed analyses, offering valuable insights and as resources for future studies, such as mathematical models for forecasting and optimization.
This research also addresses a gap in the industry by utilizing details at the work package level, which is more efficient than the current industry practice. The CDH is a practical tool since it uses the data generated during the execution of the project and is kept updated during the progress of the project. The aim of collecting details from the four structures at the work package level is to provide project managers with a dynamic tool that can obtain insights about a specific element such as a project, program, resource, equipment, or construction work package (as a product), through rolling-up, drilling-down and slice and dice techniques using the group and sort option. Also, it can be utilized for forecasting work package requirements, such as durations, costs, and GHG emissions for future projects.
Future research will explore the integration of BIM to automate data collection at the work package level. For instance, linking the CDH to BIM could enable real-time extraction of geometric, schedule, and cost data while enhancing the CDH with non-geometric attributes such as labor hours and embodied carbon. Additionally, the model’s accuracy will be re-evaluated using work package-level data sets that capture finer-grained metrics (e.g., schedule, budgets, resources such as labor, materials, and equipment, and productivity rates per WP). Collaborations with industry partners are underway to access such data, addressing the current limitation of project-level aggregation.