1. Introduction
Robot-based construction assembly refers to the use of robotic systems for joining together various building components, materials, and systems to form a complete structure or a part of a structure [
1]. It has emerged as a promising solution to address various challenges including increasing costs, labor shortages, project schedules, and the increasing demand for safe and efficient construction processes [
2]. The use of robotic systems and the corresponding changes to the existing construction workflow are expected to significantly enhance productivity, reduce construction costs, and improve safety of construction projects [
3]. Moreover, robot-based assembly systems can perform construction tasks that are repetitive, hazardous, or require high precision, thereby alleviating the burden on human workers [
4].
Despite the potential benefits of robot-based assembly in construction, one of the main challenges faced by these systems is the need for effective and efficient sequence planning. A construction task often consists of a variety of interdependent steps that must be executed in a specific sequence order [
5]. For example, installing a plumbing system requires a proper sequence of connecting pipes of different diameters and lengths, and necessitates the use of the appropriate couplings. Similarly, bricklaying requires placing the right bricks in the corresponding locations in the correct sequence. Many of these sequence planning tasks rely on spontaneous decisions, as construction tasks are often less predictable and difficult to plan out due to varying site conditions, resource availability, and evolving requirements [
6]. As a result, construction workers often need to perform manual sequence planning on the fly, which involves determining the optimal order of construction steps and taking into account the corresponding logistic considerations. Manual sequence planning is a time-consuming and labor-intensive process, requiring a significant amount of experience to ensure quality and accuracy. Moreover, the complexity of construction task sequences can vary significantly depending on the specific construction project, further increasing the difficulty of the task [
7]. Without an effective method for automated sequence planning, robot-based construction automation would not be scalable for meeting the needs of real-world complex construction tasks.
In order to enable automation systems (including construction robotics) to handle more complex multi-step operational tasks, efforts have been made to explore heuristic-based methods or learning-based methods. Early investigations included the use of mathematical and heuristic techniques in tackling the complex problem of sequence planning, such as mixed-integer linear programming (MILP) (e.g., [
8]). Recently, advances in machine learning have been leveraged to support complex sequence planning with various constraints (e.g., [
9]). These techniques aim to optimize operational sequences by considering factors such as precedence constraints, resource availability, and task interdependencies. By integrating these approaches with robotic systems, researchers expect to develop more efficient and adaptable solutions that can manage the inherent complexities and uncertainties of construction operations.
However, these methods have certain limitations that hinder their effectiveness in addressing the dynamic nature of construction projects. On the one hand, mathematical and heuristic techniques often involve the development of tailored algorithms (by human experts) that leverage domain-specific knowledge and rules [
10]. While these methods can effectively navigate the complex solution space for complex and variable construction tasks, they may impose a significant computational overhead due to the need for continuous adaptation and refinement of the heuristics as the construction process evolves. On the other hand, although machine learning methods, such as genetic algorithms and neural networks, can adapt to dynamic scenarios much easier compared to mathematical and heuristic techniques, they require a significant amount of training data to achieve accurate results [
11]. In construction operations, where site conditions and project requirements can change frequently, acquiring sufficient training data for every possible scenario is challenging, limiting the adaptability of these methods to dynamic environments.
The primary objective of this research is to design and evaluate a new system, named RoboGPT, that leverages the capabilities of ChatGPT to achieve automated sequence planning in robotic assembly for construction tasks. ChatGPT, as an advanced large language model (LLM), has demonstrated remarkable capabilities in understanding and generating human-like text, which relies on a reasoning ability for understanding the inherent structures of a sequence [
12]. By integrating ChatGPT into the construction process, we aim to minimize the reliance on manual intervention, reduce planning time, and increase the overall efficiency of robot-based assembly systems in the construction industry. Specifically, in this paper we will show how we adapted ChatGPT for the purpose of automated sequence planning in robot-based assembly for construction applications and demonstrate the feasibility and effectiveness of the proposed approach through an experimental evaluation, including comparing the ability of ChatGPT-driven robots in handling complex construction operations and adapting to changes on the fly. By accomplishing these goals, this paper will contribute to the ongoing efforts to enhance the capabilities and performance of robot-based assembly systems in the construction industry and pave the way for further integration of LLM technologies in the field of construction robotics.
The main hypothesis of this paper is that the reasoning capabilities inherent in ChatGPT can be harnessed to create a flexible and efficient sequence planning system, RoboGPT, for construction tasks. To validate the flexibility of the proposed system, we assess its performance across two distinct tasks in varying fields and offer a qualitative analysis of the outcomes. For determining the effectiveness of RoboGPT, we undertake an extensive evaluation with a pipeline installation task under four distinct scenarios. Given that the response time of ChatGPT is consistently swift, we employ the success rate across repeated scenarios as a reliable measure of system efficiency. We should note that the pipeline installation could also be considered as a case study for the flexibility test.
The remainder of this paper is structured as follows:
Section 2 reviews the existing literature on construction robotics for assembly tasks and sequential planning for multi-step operations, and explores the potential applicability of large language models (LLMs) in addressing planning tasks. In
Section 3, we introduce our novel system, RoboGPT, detailing its implementation for automating sequence planning in the realm of construction assembly.
Section 4 presents the experimental results and evaluation of RoboGPT, which is enriched with insights from two case studies. In
Section 5, we further evaluate RoboGPT’s performance through a comparative study that involves designing pipeline connections in two distinct scenarios using two different sets of pipes. Finally,
Section 6 provides a discussion on the broader implications of our findings, as well as suggestions for potential future research trajectories.
3. Methodology
3.1. Architecture
Figure 1 presents the comprehensive system architecture of RoboGPT, which is composed of four primary components: the Robot Control System, Scene Semantic System, Object Matching System, and User Command Decoder System. ChatGPT, an advanced natural language processing model, functions as the central intelligence within the system. Upon receiving task descriptions and specific requirements from users, ChatGPT meticulously generates sequential solution commands in a step-by-step manner, adhering to the precise requirements of the task. The generated response text is subsequently decoded by the User Command Decoder System and transmitted to a Unity-based virtual environment in the form of virtual objects. The Scene Semantic System is responsible for detecting real-world objects, which are then sent to the Unity environment to be meticulously aligned and matched with their virtual counterparts. Once the alignment is complete, the objects, in conjunction with the corresponding actions derived from the commands, are relayed to the Robot Control System to facilitate real-world object manipulation.
3.2. Robot Control System
The robot system tested in this study was a Franka Emika Panda robot arm, which is a lightweight, compact, and versatile robot designed for human–robot collaboration and which is widely used in manufacturing, research, and education as it is known for its ease of use, flexibility, and reliability. The Panda has seven degrees of freedom corresponding to its seven joints. Each joint is equipped with a force/torque sensor and a joint-angle sensor to accurately measure the states of the robot arm, allowing it to move in various directions and perform intricate tasks with high precision. A parallel gripper is attached as the end-effector on the seventh joint, which can be used to interact with objects through picking up and dropping.
To smoothly control the end-effector and generate a stable moving trajectory, the impedance controller is applied in cartesian coordinates, as shown in
Figure 2. The impedance of the end-effector can be adjusted based on the force or torque applied by the environment, allowing the robot to adapt to varying conditions. Specifically, the controller imposes spring–mass–damper behavior on the mechanism by maintaining a dynamic relationship between the force, position, velocity, and acceleration:
where F, v
ee, ∆x
eerob, l, and d ϵ R
3 are the implemented force on the end-effector, the velocity of the end-effector, and the position of the end-effector in the robot coordinate system and payload, respectively. Given the end-effector’s current position x
ee_currob and the desired position x
ee_desirerob, ∆x
eerob is calculated as:
x
ee_desirerob is the real-world target location derived from the Real-Virtual Object Matching System that is discussed in the following section. In order to control the virtual robot arm in Unity to interact with the virtual objects, the real-time joint position
qrob ϵ R
3 is sent to Unity through the ROS-Unity bridge (RUB) to synchronize the virtual arm. Each element in
qrob is the rotation angle for the corresponding joint. The gripper’s status of the real robot arm is also sent through the RUB to instruct the virtual robot’s behavior, and the interaction between the virtual gripper and objects will be sent back to ROS to control the real gripper’s action.
3.3. Semantic Segmentation System
The Scene Semantic System collects the visual information from the surrounding environment and detects the real target objects for downstream alignment. A Velodyne-16 LiDAR (VL16) is used to capture point cloud data and save it on the ROS platform. The LiDAR sensor coordinate system is calibrated with the Panda coordinate system to ensure the positions of the detected objects.
The VL16 was selected as the scanning sensor because of its high scanning speed and stable scanning results. Since the VL16 only has 16 scanning rings in the vertical direction, which is too sparse to capture the detail spatial information, an augmentation scanning strategy was applied to register the scanning results from multiple viewpoints and generate a dense scanning result. To eliminate the influence of the error caused by the registration of multiple frames, we applied the density-voting clustering method to shift the drifting points to the closest density center so that all the returning points will be close to the object surfaces and the shape of the objects can be perfectly captured.
The virtual scene data, including joint states, point clouds, and virtual objects with physical properties, are then sent to the Unity game engine for interface reconstruction. In order to subscribe to data from ROS via the network, the ROS-Unity bridge and ROS# are used to build a WebSocket, which allows two-way communication and data transfer between ROS and Unity. We also used ROS# to build some nodes in Unity to publish and subscribe to topics from ROS. Baxter’s state data (URDF, joint, and gripper state) are used to build a virtual Baxter that replicates the same states of the real Baxter. The same prefab library as mentioned in the scene recognition system is used to provide virtual object information with physical properties that can be used to rebuild stationary objects in the game engine. We also use the Unity physical engine to assign the point cloud and virtual object with physical properties and rebuild a virtual working scene based on the data from ROS.
The augmented and clustered point cloud PC
cam ϵ R
N*3 is then fed into PointNet++, which we took as our segmentation model as shown in
Figure 2. N denotes the number of points according to the input size of the model. PointNet++ is a deep learning model that has been well-trained on various point cloud datasets and can handle both object detection and semantic segmentation tasks. In this application, we only focused on the segmentation branch of PointNet++ to obtain the object labels of each point. The segmented points are clustered as point sets [PC
cam0, …, PC
camn] and the corresponding predicted labels [c
0, …, c
n]. The point sets are then used to estimate the oriented bounding boxes that closely wrap all of the points as [Box
cam0, …, Box
camn]. The bounding boxes are parameterized as Box
cami := [s
iT, p
iT]
T, where s
i and p
i ϵ R
3 are the size (width, length, and height) and location (x
i, y
i, and z
i). The labels, sizes, and locations of segmented point sets are then sent to Unity through the RUB as the classification results, size estimation results, and pose estimation results.
3.4. Command Decoder
The command decoder works as the translator to transfer the response from ChatGPT in natural language into a machine-understandable programming command so that the robot arm can execute the actual sequential actions inferred by ChatGPT. We used the ChatGPT-4 model and coded with python and C# to build the API to communicate between Unity and the online model. The API is based on the HTTP request. The user sends a text prompt to the API, and it will return a response in the form of a text message. The API also supports various customization options to regularize the response by typing the specific requirements in the “system” section.
For most construction assembly tasks, the sequential actions can be simplified as moving an object to a certain location. For example, moving the pipe to position A or putting the brick at position B. Therefore, the operation command for the sequential actions can by represented by an action, object, and target position. In order to make the reply from ChatGPT more explainable, we designed the “system” with three principles:
ChatGPT will generate the reply step by step in an execution order.
For each step, there is only one motion and one object to be moved or operated. There is only one target location.
The related words about action, object, and target position must be surrounded by brackets.
Given the regularization principles, the reply from ChatGPT could be simplified as:
…
Therefore, the regularized reply from ChatGPT can be firstly split into single steps. Then, the single steps can be used to extract the action, object, and target position, as shown in
Figure 3. The brackets are used to crop the action or object names as strings. The detected string will then be checked to see if it shows up in the pre-defined action or object dictionary. If the dictionaries contain the string, the corresponding action or object will be sent to the Real-Virtual Object Matching System. The dictionary contains the name of common actions and objects in construction sites.
3.5. Object Matching System
The detected action, object, and position will then be sent to the Object Matching System to be paired with the detected objects from the real world and to be transferred to the robot arm control codes. Specifically, the detected object, noted as obj
prompt, will be firstly matched with the label of segmented objects from the Semantic Segmentation System, noted as obj
seg. Note that the labels of the segmentation system are strings that are included in the object dictionary. Given a matched pair (obj
iprompt, obj
jprompt), obj
iprompt is then assigned the parameters of obj
jseg, including l
jseg for size and p
jseg for position. Then, obj
iprompt has four major properties:
where action
jprompt is the matched action from the dictionary and position
iprompt is the target location. Thus, the desired operation on obj
iprompt is parameterized as its current position, the action, and its target position, which can be understood by the robot arm. Then, obj
iprompt can be sent to the Panda for a single-step operation in a sequence.
5. Comparison Study
In order to demonstrate the advantages of the proposed RoboGPT system in intricate multi-stage robotic operations and investigate the capacity of ChatGPT to address real-world construction challenges, we conducted a comprehensive evaluation of the RoboGPT system in the context of a pipeline installation under various conditions. This comparative study aimed to assess the system’s performance, as well as to elucidate its potential and limitations.
We opted not to incorporate the material stacking and Hanoi tower puzzle scenarios in this investigation for two primary reasons. Firstly, the material stacking task is relatively elementary, as it predominantly necessitates rudimentary knowledge of object stacking based on size. Secondly, the central challenge of the Hanoi tower puzzle resides in completing the task within a constrained timeframe, which does not align with the objectives of our study.
Conversely, the pipeline installation scenario presented a more open-ended challenge, requiring the system to determine the spatial dimensions, evaluate resource availability, and devise an appropriate method for connecting the pipes. It is crucial to note that this task does not entail a singular solution; rather, multiple viable solutions can achieve the desired outcome. Consequently, the pipeline installation task, which demands a thorough assessment of dimensions, resource estimation, and sequencing while considering both spatial and resource constraints, is better suited for our comparative analysis. We applied two different tasks with two different conditions to evaluate the performance of the purposed system. Since pipe installation tasks in the real world often require a large spatial space, which is hard to manipulate with a research-based robot arm, we built the simulation environment in Unity to test the results. Given the knowledge from a pipe installation process at a real construction site, we designed the Avoid Obstacles and Pass Points tasks for further testing.
The Avoid Obstacles task is to design a pipeline that connects two points, but the pipes cannot pass certain points. This task was designed to simulate a case where the pipes have t avoid some pre-built structures or safety areas. The testing environment was designed as a 10 × 10 × 10 room with a start point location of P
start1 = (5, 5, 0) on the floor and the end point of P
end1 = (5, 5, 10) o the roof. The two obstacle points, named A
obs and B
obs, were located at (5, 5, 5) and (5, 7, 5), respectively.
Figure 11 shows the setup environment of Avoid Obstacles. The green cube denotes the start point and the red cube denotes the end point. The two small black cubes denote the obstacles to be avoided.
The Pass Points task is to find a solution and design a pipeline between two given positions, passing certain points. This situation was designed to simulate a case where the pipes must connect some devices, such as air conditioners, or the pipes must pass through some holders on the wall as supports. Similarly, the testing environment was also in a 10 × 10 × 10 room with the start point location of P
start2 = (0, 0, 0) and the end point of P
end2 = (10, 10, 10) The two mandatory points, named A
man and B
man, were located at (0, 0, 8) and (6, 6, 0), respectively.
Figure 12 shows the setup environment of Pass Points. Similarly, the green cube denotes the start point and the red cube denotes the end point. The two small black cubes denote the mandatory points to be connected.
5.1. Avoid Obstacles Task
In the Avoid Obstacles task, we set two different conditions: the constant condition and variable condition. To be specific, the constant condition refers to the situation where the pipes to be used to build the pipeline are the same size. In our case, we set the length of the pipes to be 2. Note that the diameter of the pipes was ignored. On the contrary, the variable condition referred to the case where the pipes’ sizes were not fixed. To be specific, we used three types of pipes with the lengths of 2, 3, and 4. The system can choose any of the pipes to build the pipeline.
The prompt we used for the constant condition is as follows: “Can you help me with pipe connection? We have several 2ft length straight pipes (pipe 2ft), 3ft length straight pipes (pipe 3ft), 4ft length straight pipes (pipe 4ft). The start position is (5ft, 5ft, 0ft) direction is the positive Z axis, the end position (5, 5, 10) direction is the negative Z axis. We assume that each straight pipe can be connect to each other directly. You can just tell me the position of each pipe, such as ‘pipe 2ft #1 (5, 5, 2) z axis, pipe 2ft #2 (5, 5, 4) z axis, pipe 2ft #3 (5, 7, 4) y axis’. To be noted, each pipe must maintain parallelism to the X, Y, and Z axes. There are two obstacles at point (5, 5, 5) and point (5, 7, 5), the pipe cannot pass through this point from neither X, Y nor Z axes.”
The prompt for the variable condition is: “Can you help me with pipe connection? We have several 2ft length straight pipes (pipe 2ft), 3ft length straight pipes (pipe 3ft), 4ft length straight pipes (pipe 4ft). The start position is (0ft, 0ft, 0ft) direction is the positive Z axis, the pipe connection must pass the first mandatory point (0, 0, 8), then pass the second mandatory point (6, 6, 0), finally to the end position (10, 10, 10) direction is the negative Z axis. We assume that each straight pipe can be connect to each other directly. You can just tell me the position of each pipe, such as ‘pipe 2ft #1 (0, 0, 2) z axis, pipe 4ft #1 (0, 0, 6) z axis, pipe 3ft #1 (0, 3, 6) y axis’. To be noted, each pipe must maintain parallelism to the X, Y, and Z axes. The pipe must pass each mandatory point (0, 0, 8) and (6, 6, 0).”
For each condition, we used the same prompt to generate 20 trials.
Table 1 lists the counting results of successful and failed trials. The sub-optimal trials refer to the cases where the RoboGPT system could give the correct connection design, but with unnecessary pipes and detours.
The results show a significant difference between the successful rates of the two conditions, being 100% for the constant condition and 25% for the variable condition. Theoretically, the two conditions corresponded to two difficulty levels in solving the problem. For the first condition, the pipe’s length is almost the unit length compared with the room’s scale. There is no need to consider the arrangement of pipes to achieve a certain length of the total pipeline. In other words, the final solution could use any number of pipes and the only requirement was to avoid A
obs and B
obs, and to finally reach P
end 1. However, for the second condition, the lengths of the pipes vary from 2 to 4. So, the solution had to not only satisfy the requirement of passing the mandatory points and reaching the target, but also had to find the proper combination of pipes with different sizes to achieve the full length of the pipeline. This was an extra constraint which restricted the solution space, added logical difficulty, and made the problem harder to solve. In other word, the resource pipes that could be used to build the pipeline were restricted.
Figure 13 shows the assembling process of a successful trial.
Figure 14 shows a typical sub-optimal solution demonstrating that, given sufficient pipes without any constraints, ChatGPT can give a redundant design with unnecessary costs. The proposed pipeline by ChatGPT makes an unnecessary detour to avoid the obstacles.
Figure 15 gives a successful example as an standard solution of variable condition.
Figure 16 illustrates the shortcomings of ChatGPT in spatial understanding. The layout on the left shows the failure in condition 2 as the pipe only reaches the height of the end point, but it does not find the location on the x-z plane. The failed layout on the right shows that the start and end point of the pipe was not understood, so the following pipe was connected from the middle of the previous pipe, as shown in the red circle. The results proved that ChatGPT cannot always precisely understand the spatial information from a pure text input.
5.2. Pass Points Task
In the Pass Points task, we used the same two conditions as in the previous task. The prompt we used for the constant condition is as follows: “Can you help me with pipe connection? We have several 2ft length straight pipes (pipe 2ft). The start position is (0ft, 0ft, 0ft) direction is the positive Z axis, the pipe connection must pass the first mandatory point (0, 0, 8), then pass the second mandatory point (6, 6, 0), finally to the end position (10, 10, 10) direction is the negative Z axis. We assume that each straight pipe can be connect to each other directly. You can just tell me the position of each pipe, such as ‘pipe 2ft #1 (0, 0, 2) z axis, pipe 2ft #2 (0, 0, 4) z axis, pipe 2ft #3 (0, 2, 4) y axis’. To be noted, each pipe must maintain parallelism to the X, Y, and Z axes. The pipe must pass each mandatory point (0, 0, 8) and (6, 6, 0).”
The prompt for the variable condition is: “Can you help me with pipe connection? We have several 2ft length straight pipes (pipe 2ft), 3ft length straight pipes (pipe 3ft), 4ft length straight pipes (pipe 4ft). The start position is (0ft, 0ft, 0ft) direction is the positive Z axis, the pipe connection must pass the first mandatory point (0, 0, 8), then pass the second mandatory point (6, 6, 0), finally to the end position (10, 10, 10) direction is the negative Z axis. We assume that each straight pipe can be connect to each other directly. You can just tell me the position of each pipe, such as ‘pipe 2ft #1 (0, 0, 2) z axis, pipe 4ft #1 (0, 0, 6) z axis, pipe 3ft #1 (0, 3, 6) y axis’. To be noted, each pipe must maintain parallelism to the X, Y, and Z axes. The pipe must pass each mandatory point (0, 0, 8) and (6, 6, 0).”
Similarly, we used the same prompt to generate 20 trials with new chat channels.
Table 2 lists the counting results of successful and failed trials. To intuitively show the results from the two conditions, we picked a success trial and a failed trail from each condition and provide the visualizations in
Figure 17,
Figure 18,
Figure 19 and
Figure 20.
Figure 17 and
Figure 18 show the successful and failed trials under constant conditions. The layout in
Figure 18 further proves the shortcomings of ChatGPT in spatial understanding. There were two gaps along the pipeline, indicating that ChatGPT might wrongly overlap the two points only based on their 2D coordinates. The two end points in the red circle had the same x and z coordinates, but different y coordinates. The ones in the yellow circle had the same x and y coordinates, but different y coordinates. In other words, if the coordinates of the two points were the same along one or two axes, they would be wrongly aligned and treated as the same points. In this case, it is reasonable to deduce that ChatGPT relied more on pure, separated numerical analysis to solve the real-world problem. The x, y, and z coordinates of the two points might be separately compared and the two points would be considered as the same if the sum of the total difference is under a threshold. Even if the two end points were on the same x-z and z-y planes, they would still be treated as the same points in 3D space. Thus, the visual or multi-dimensional inputs are required for ChatGPT to build an accurate 3D scene for real-world operation.
Figure 20 shows the influence of the constraint caused by using different sizes of pipe. The pipeline could only approach the mandatory points, but not pass them.
In conclusion, the system demonstrated superior performance under constant conditions in the second task as opposed to the first one. This can be attributed to the fact that avoiding specific points offered a greater array of potential solutions compared to passing points, resulting in a higher level of stability for the ChatGPT system. Consequently, the success rate for the second task was 1, whereas it was only 0.7 for the first task.
Considering the two tasks and the two conditions derived from real-world environments, it is evident that, in contrast to study cases 1 and 2, employing ChatGPT and RoboGPT systems to address real-world construction tasks introduces additional constraints that significantly impact the stability and overall performance of the system. Furthermore, it is crucial to recognize that addressing real-world tasks encompasses not only achieving the desired objectives, but also optimizing resource utilization. Consequently, future research should aim to guide the ChatGPT agent towards identifying the most efficient and effective means of resolving the problem at hand.
6. Conclusions
In this paper, we presented a robotic system leveraging ChatGPT-4 for automated sequence planning in complex construction assembly tasks, such as assembling structural components of a building, installing electrical and plumbing systems, and coordinating the movement of construction equipment on site. The tasks involved a wide range of spatial constraints, including a limited workspace, safe operation distances, and the proper placement of components, as well as resource constraints, such as the availability of equipment and personnel. We developed a framework that allowed ChatGPT-4 to ingest relevant input data, including construction specifications, blueprints, and a list of available resources. The model was then able to generate an optimized assembly sequence plan by decomposing the tasks into logical steps, ensuring that the spatial and resource constraints were satisfied. Each step included specific instructions for the robotic system, such as the order of operations, the type and quantity of resources required, and the optimal path for the movement of equipment and materials. To evaluate the effectiveness of the ChatGPT-4-based method, we compared its performance with that of two real-world construction tasks. Our results showed that the ChatGPT-4-based system has the potential to understand the background logic of a sequential task and give a corresponding solution. We also used the test results from 80 trials to intuitively demonstrate the current limitations and boundaries of the ChatGPT agent in solving real-world tasks considering the physical constraints and resource restrictions. To be able to assist human workers in solving real construction problems, the abilities of ChatGPT for spatial understanding and dynamic management need to be improved.
Honestly, there are several limitations to our approach. First, we have yet to fully understand the underlying mechanisms that allow ChatGPT-4 to be used for construction task sequence planning, particularly when considering spatial and resource constraints. Second, the level of trust human workers have in the ChatGPT-4-based system remains unknown, which could impact the adoption of this technology in real-world scenarios. Lastly, ChatGPT-4’s ability to process and analyze imagery data is limited, restricting its applicability in situations where visual information is crucial. Future research should focus on addressing these limitations and expanding the scope of the study. It is essential to test more construction applications to validate the robustness of the ChatGPT-4-based method and assess its performance across diverse tasks. Furthermore, investigating the reasons behind ChatGPT-4’s success in construction task sequence planning will enhance our understanding of its capabilities and help improve the model. Additionally, integrating ChatGPT-4 with computer vision techniques could pave the way for a fully automated process, which would enable seamless collaboration between the language model and visual data processing systems, ultimately boosting efficiency and accuracy in construction sequence planning. Last but not least, in the pipeline installation task, we only used the successful rate as the performance indicator. However, multiple criteria should be considered in real construction tasks such as system robustness, different tasks’ efficiency, and user satisfaction. The corresponding data should be collected and the multi-criteria methods should be applied to evaluate the overall performance of the system, including the stable preference ordering towards ideal solution (SPOTIS) [
60] or RANKing COMparison (RANCOM) [
61].
In our future work, we plan to augment our RoboGPT system with Reinforcement Learning from Human Feedback (RLHF) [
62] to enhance its adaptability and robustness across a wide range of construction scenarios. To achieve this, we will design and integrate a feedback mechanism that enables the collection of human expert preferences and evaluations to guide the model’s learning process. By incorporating RLHF, the RoboGPT system can iteratively update its sequence planning capabilities based on expert feedback, allowing it to better comprehend the intricacies and subtleties of construction tasks. This approach enables the system to adapt more effectively to the dynamic nature of construction projects, while also reducing its reliance on large amounts of training data. Furthermore, we will develop a method for incorporating feedback from virtual simulations, which will reflect the consequences of the generated construction sequences. This additional source of feedback will enable the RoboGPT system to refine its calculations in real time and improve its overall performance. Also, as the current system serves as a research prototype, our future work will be directed towards enhancing its user interface. This is intended to increase its accessibility and utility for users with various levels of expertise and make the system more inclusive and user friendly for beginners.
Additionally, our experiments qualitatively illustrate that the use of RoboGPT can contribute to lower operational costs and increased efficiency across a range of construction tasks. Compared with the traditional robotic and automation methods, our proposed system could further reduce the need for human decision making and intervention and contribute to lower operational costs and increased efficiency. To make the system more applicable for general implementation, it is still worth quantitatively showing the actual increase in efficiency by using our system in specific tasks.