4.1.1. MapReduce (MR)

MapReduce was developed for the handling of big data. It utilizes a distributed processing model in which two functions, as indicated by the name itself, map and reduce, are employed to write analytical tasks. Mappers and reducers are the processes that collect

the data from these functions for further processing. Initially, mappers collect and read the input information to process it for subsequent results generation. The output of mappers is used by reducers which give the results that are ultimately stored in the file system. MR has been used by Jiao et al. [53] to develop an augmented framework for BIM. Similarly, it has also been used in construction knowledge maps [54] and other big data applications [54].

The use of MapReduce in the construction industry is inevitable due to the big data applications within the construction industry. The usability of the MapReduce framework in the construction industry relies on the managemen<sup>t</sup> of big data in a particular way. Accordingly, the datasets are analyzed and divided into categories to reduce clutter and present an easy-to-understand data output. The basic framework of MapReduce includes data input, data chunks, decomposition mappers, decomposed output, linear mappers, linear reducers, and combined output. The exact series and number of components in the framework can vary depending on the version used. However, the overall features and application of MapReduce remain the same, i.e., reduction of data into manageable chunks. The use of MapReduce not only distributes data into smaller chunks but also helps develop datasets that present a more analytic view of big data. Having organized datasets within the construction industry is of key importance as it can greatly increase the efficiency of data managemen<sup>t</sup> and decision making based on data analysis.

Hadoop was the popular and first big data platform that introduced and made it easy for people to work on MR by executing its programs successfully. For tasks requiring batch processing, MR proved itself to be an effective tool as a typical cluster contains interlinked mappers and reducers that assist by running MR programs side by side at the same time. Though it has its benefits, these are not devoid of the drawbacks. These drawbacks include running some applications for graph generation and real-time and iterative processing. By dissociating the rest of the ecosystem from the processing of MR, Hadoop's latest versions have tried to sort out the problem. Yet another resource negotiator (YARN) has also been introduced, which functions by providing resource managemen<sup>t</sup> and scheduling related functions of MR and has made it easy to implement innovative applications by Hadoop.

Hadoop models have been used in construction for smart buildings and disaster managemen<sup>t</sup> [55], failure prediction of construction firms [56], workers' safe behaviors in a metro construction project [57], and other relevant applications. The overall platform design architecture of Hadoop offers high reliability; adopt cluster technology, multi-copy technology, independent backup technology, and other means to reduce the data failure rate effectively and build a reliable data application service platform. First, the processing of big data into batches and simultaneous reduction and refining of the data are carried out using MR. Next, data are batched into similar items to streamline the analyses. This step further reduces noise or datasets that do not align with a particular batch of data. Finally, a dataset is obtained, which is refined and aligned with the original search purpose.

#### 4.1.2. Directed Acyclic Graph

Big data platforms also use Directed Acyclic Graph (DAG) which is an alternative processing model. In comparison with MR, DAG works by relaxing map-then-reduce, the style of MR, which is supported by Spark. Spark is widely accepted for reactive and iterative applications due to its supremacy over MR in high expressiveness and in-memory computation. Disk-resident and memory-resident tasks are conducted ten and one hundred times faster using Spark than MR. DAGs show relationships among variables, making them easier to understand. DAGs provide major advantages that enable experts and researchers to construct complex causal relationships in which nodes represent stochastic variables, and directed edges (arrows) indicate direct probabilistic dependencies among the relevant variables. DAGs are also able to encode deterministic as well as probabilistic relationships among the variables. The usage of Spark and associated DAGs has been reported for construction profitability analysis [39], waste managemen<sup>t</sup> [25], energy monitoring service on smart campuses [58], and others.

Spark and Hadoop are among the ML tools with enormous potential in construction engineering and management. Figure 7 compares the two tools that can inform research in construction. The speed of both these systems is better than other algorithms and ML tools currently in use in the construction industry. Moreover, fault tolerance in both these systems is also high and has greater scalability than existing models. The data storage in these systems is slightly different in that Spark uses a memory system while Hadoop utilizes a disk for data storage. The language for both these tools is also different since Spark is written in Scala while Hadoop has been developed using JavaScript. Despite the slight differences, both these tools provide the opportunity to process data in the form of batches and at a higher speed than previously existing models, making them potential tools for futuristic model developments in construction engineering and management. JavaScript has been used in construction to anticipate building material reuse [59], automated progress control coupled with laser scanning [60], shared virtual reality for design and managemen<sup>t</sup> [61], construction information mining [62], and others. Similarly, Scala has been used for the process information modeling concept for on-site construction managemen<sup>t</sup> [63].

**Figure 7.** Components of Spark and Hadoop. A side-by-side comparison of Spark and Hadoop provides insights about the usability and applications of each.

#### 4.1.3. Big Data Processing in Construction

Big data processing has been effectively utilized in the construction industry for failure prediction data [56], construction waste analytics [25], profitability data [39], modular and prefabricated construction [52], fire incident managemen<sup>t</sup> [64], smart campus energy monitoring [58], healthier cities managemen<sup>t</sup> [28], smart road managemen<sup>t</sup> [40], and others.

Though MR and Spark have their own significance, these are less frequently employed in the construction industry to process big data such as BIM-associated data. Partial BIM models' retrieval was optimized by MR by Bilal, et al. [65] and Chang and Tsai [66]. The authors found a loop in the Hadoop MR logic of data distribution. For overcoming the query problem, a few steps of prepartitioning and processing are introduced for relevant BIM data parts that are later stored in Hadoop clusters. Node multi-threading during data analysis helped by making the CPU work its maximum. This helped in customizing Hadoop for BIM data while the YARN application implemented querying components. YARN applications are further utilized to develop a BIM system for quantity estimation and clash detection that can execute required tasks with the performance improved many-fold.

Another research group worked for naive and expert BIM users by developing a system for BIM data storage and retrieval [67]. The authors developed a system for cloud BIM to retrieve and represent big data intelligently. This system helped develop an interactive interface to maximize the usability and utility of construction big data. Complex BIM data are retrieved by processing proposed natural languages after reformulating user queries. This data are then visualized by mapping on various visualizations. Before query evaluation, two BIM collections are merged to optimize the process of query execution. Using this technology, a 40% reduction in response time has been witnessed compared to other traditional technologies. Currently, the utilization of BIM is limited across the construction and facilities managemen<sup>t</sup> stages. The real intent of BIM could only be achieved once applied at each stage of the building lifecycle.

#### *4.2. Big Data Storage*

Big data storage is also an important aspect of BDE. In construction, big data storage has been explored for forecasting the success of construction projects [68], smart buildings data storage [69], tender price evaluation [70], and others. Despite the availability of BIM data storage, the current applications in construction still require successful implementation. Social BIM, proposed by Das et al. [71], captures building models and the social interactions among the users. The authors developed BIMCloud based on the distributed BIM framework.

Similarly, a two-tiered hybrid data infrastructure was proposed by Jeong et al. [72] for data managemen<sup>t</sup> and monitoring of bridges. In this model, the client tier efficiently completes some analytical tasks by storing structured data momentarily using MongoDB, while the central tier stores sensor data permanently using Apache Cassandra. Lin et al. [67] also used MongoDB to store BIM data obtained through building models.

Overall big data storage is provided by either emerging NoSQL databases or distributed file systems, as explained subsequently.

#### 4.2.1. Distributed File Systems

The distributed file systems consist of Hadoop Distributed File System (HDFS) and Tachyon. HDFS is designed to deal with large and complex databases such as those related to BIM, waste, and other construction big data sources. It operates with the commodity servers grouped together in a cluster. As it utilizes several servers, the probability of hardware failure also increases. To overcome this problem, HDFS introduces fault tolerance achieved through the distribution of data and their replication. However, in situations where low-latency data access is required, HDFS is not a suitable option as it shows inferior performance. Moreover, it is also troublesome to save many small files due to issues in managing meta-data. Moreover, it is not useful if modifications must be made concurrently at random locations in the data. Nevertheless, HDFS has been utilized by construction researchers for observing construction workers' behavior [73], improving road performance [39], and investigating profitability performance [39]. Furthermore, based on the distributed input from HDFS, it facilitates building predictive models for conducting building simulations that give output in a predictive model markup language.

Tachyon is a distributed file system designed to extend HDFS benefits by providing access to the distributed data across the cluster at memory speed. It provides better performance through in-memory data caching and backward compatibility allows MR and Spark tasks to run without changing the codes required in those programs. Tachyon has been utilized in construction for handling unstructured documents [65] and file storage [74]. The Tachyon performs better than HDFS, is backward compatible and can handle the MapReduce jobs without any further modifications.

## 4.2.2. NoSQL Databases

Relational databases have been common for data managemen<sup>t</sup> in past decades. However, new applications were designed for better performance, scalability, and flexibility as the technology emerged. Relational databases lag because of their special processing and storage needs. As a result, new systems were devised to fill this technology gap. One such system is the "Not only SQL" system that has optimized data managemen<sup>t</sup> in several ways. For achieving flexibility, it supports schemaless storage rather than schema-oriented storage. NoSQL has been widely used in different industries, including construction, due

to its fragmented nature. Some examples of NoSQL in construction include integration of lessons learned knowledge in BIM [75], web service framework for construction supply chain collaboration and managemen<sup>t</sup> [76], and Social BIMCloud implementation [71]. NoSQL systems store schemaless data in a non-relational model. It does not set too many restrictions on value and allows easy product determination. Generally, when NoSQL databases are set to key values, they carry out only specific tasks without evaluating specific values. The key-value database is mainly tailored to the business accessed through the primary key. These systems have four data models that are briefly discussed below.

• Key-value

This is the simplest data model used for unstructured data storage. However, the data lack self-description. It has been used for knowledge managemen<sup>t</sup> in construction [77] and integration of lessons learned knowledge in BIM [75]. BIM provides positive outcomes on project success, such as cost and time reduction, communication and coordination improvement, and increased quality. Big data utilization in BIM can be beneficial to discover root causes of poor building performance, perform real-time data queries, improve the decision-making process, improve productivity, and reveal new designs and services in the construction industry, as is the case in every industry.

• Document

This model can store self-describing data. However, this model can lag in terms of efficiency. It has been used for unified lifecycle data managemen<sup>t</sup> in architecture, engineering, construction, and facilities managemen<sup>t</sup> through BIM integration [78].

• Columnar

Aggregated columns, grouped sub-columns, and sparse data can be stored by using this model. It has been used for integrating digital construction through the internet of things [79] and smart archiving of energy and petroleum construction projects [80].

• Graph

This model works well for property-graph-based huge datasets in relationship traversal. It has been used for the 4D construction managemen<sup>t</sup> information model of prefabricated buildings [81] and the development of a BIM-enabled software tool for facility managemen<sup>t</sup> [82].

Databases concerning big data storage and managemen<sup>t</sup> are widely used worldwide for research on various topics. The construction industry also relies on big data sources and databases, observed throughout the last five years to a decade. As shown in Figure 8, the search engine is among the most widely searched database in the last five years, followed by relational and graph DBMS. Until the time of analyzing data for this review, i.e., November 2021, other heavily used databases for extracting and using big data for the construction industry include document stores, native XML, key-value stores, and wide column stores. Object-oriented DBMS and multivalued DBMS search are considerably lower than relational DBMS and graph DBMS, whereas the search engines outperform all other DBMS. These different databases provide data sources for BIM and computational sources for developing structures that could guide larger construction projects. The rising trend in using big data sources shows the increasing interest among the construction industries in big data. For example, exchanging and reusing information is critical for engineering and construction project management. The issues pertaining to data exchange have been minimized with the Extensible Markup Language (XML) application. Such an XML-based Distributed Construction Estimating System (XDCES) has been helpful to reduce the overload of cost-estimating information exchange. Similarly, construction-based DBMS enables all construction companies to build and maintain a database easily. It allows supervisors and workers to capture information using a mobile or tablet device, and then all of that information is stored in the cloud and accessible via a desktop version.

**Figure 8.** Database popularity in 2016–2021 based on search trends.

#### *4.3. Big Data Analytics (BDA)*

BDA gathers information from a variety of disciplines. All these disciplines have one thing in common: to find out data patterns. Some of these related disciplines are data mining, statistics, business analytics, predictive analytics, data analytics, knowledge discovery from data, and the most recent one, big data. Big data use the previous techniques to broaden the field of data analytics. For BDA, some of the ML-based tools are developed. In construction projects, BDA has been used for improving building design and effective performance monitoring [37], project safety, energy, resource, overall managemen<sup>t</sup> and decision-making frameworks [38], and quality and waste managemen<sup>t</sup> [24]. Big data analytics has been taken a step further by developing predictive analysis techniques. Ngo et al. [83] used a factor-based big data predictive analytic tool for analyzing the capacity of construction industries to deal with big data. This tool was tested and validated on four different construction organizations to ensure that the predictive analytic

method could improve how the construction industry can use big data. The integration of big data in the construction industry remains an avenue that requires further research in terms of big data analytics. The gaps in this area were explored by Atuahene et al. [30] and Atuahene et al. [84]. It was identified that the managemen<sup>t</sup> and processing of data by firms led to the generation of more data, which made data analysis an uphill task. Developing an integrated framework for managing big data and sorting the useful datasets can greatly increase the usability and application of big data in the construction industry. Overall, data analytics is conducted through statistical, data mining, and regression techniques, as explained below.
