A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration

Wang, Hao; Xue, Shuqi; Zhang, Hongbo; Wang, Chunhui; Fu, Yan

doi:10.3390/aerospace12040325

Open AccessArticle

A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration

by

Hao Wang

¹

,

Shuqi Xue

²,

Hongbo Zhang

¹,

Chunhui Wang

^2,* and

Yan Fu

^1,*

¹

School of Mechanical Engineering and Science, Huazhong University of Science and Technology, Wuhan 430072, China

²

National Key Laboratory of Human Factors Engineering, Astronaut Research and Training Center of China, Beijing 100094, China

^*

Authors to whom correspondence should be addressed.

Aerospace 2025, 12(4), 325; https://doi.org/10.3390/aerospace12040325

Submission received: 11 February 2025 / Revised: 31 March 2025 / Accepted: 7 April 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Aerospace Human–Machine and Environmental Control Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Human–robot collaboration for lunar surface exploration requires high safety standards and tedious operational procedures. This process generates extensive task-related data, including various types of faults and influencing factors. However, these data are characteristic of multi-dimensional, time series, and intertwined. Also, prolonged tasks and multi-factor data coupling pose significant challenges for astronauts in achieving safe and efficient fault localization and resolution. In this paper, we propose a method to enhance the base large language models (LLMs) by embedding knowledge graphs (KGs) of lunar surface exploration, thereby assisting astronauts in reasoning about faults during the exploration process. A multi-round dialog dataset is constructed through the knowledge subgraph embedded in the request analysis process. The LLM is fine-tuned using the p-tuning method to develop a specialized LLM suitable for lunar surface exploration. With reference to the situational awareness (SA) theory, multi-level prompts are designed to facilitate multi-round dialogues and aid decision-making. A case study shows that our proposed model exhibits greater expertise and reliability in responding to lunar surface exploration tasks than classical commercial models, such as ChatGPT and GPT-4. The results indicate that our method provides a reliable and efficient aid for astronauts in fault analysis during lunar surface exploration.

Keywords:

lunar exploration; human–robot collaboration; human–robot dialogue; large language model; knowledge graph; situational awareness

1. Introduction

Lunar surface exploration represents one of the strategic missions of this century, facing numerous challenges such as low light conditions, Earth–Moon communication issues, and terrain mapping [1,2,3]. Nevertheless, robotic missions for lunar surface exploration have been conducted with increasing frequency and success, yielding various results [4]. As the scope and complexity of lunar surface exploration missions continue to expand, the quality requirements for system-related input and output data will increase to enhance mission success rate [5]. However, current robotic autonomous systems are limited to executing pre-programmed functions and cannot perform cognitive decision-making or emergency-related tasks with a high degree of quality [6]. Therefore, joint human–robot missions, where astronauts perform cognitive decision-making and emergency-related tasks to obtain higher-quality data, are crucial for advancing lunar surface exploration research [7].

During decision-making and emergency response, robots and sensors generate vast amounts of data that astronauts must analyze to make informed decisions [8,9]. For example, as shown in Figure 1, lunar surface sampling is a multifaceted process comprising several steps, and each step may have a series of faulty problems, underlying reasons and resolution measures. Meanwhile, the lunar surface environment is fraught with uncertainties and unforeseen situations, such as sensor failures, unknown obstacles, and dynamic changes [10,11,12]. Furthermore, astronauts may need to monitor multiple robots or manage several tasks simultaneously, facing challenges such as diverse state information, varying operation modes, and a constantly changing environment [13]. The underlying issues are the cognitive and coding heterogeneity between humans and robots, making it difficult for astronauts in a weightless, blood-redistributed state to efficiently process the real-time complex situation and obtain the required targeting information [14]. Therefore, new dialogue methods need to be developed to meet astronauts’ interaction needs, enabling them to efficiently and accurately retrieve information on faulty nodes and assist in generating appropriate measures.

Advancements in KGs and LLMs have significantly improved the question-and-answer (Q&A) capabilities of AI systems [15]. LLMs, such as ChatGPT and GPT-4 from Open AI and ChatGLM from Tsinghua University, excel at responding to human commands [16,17,18]. However, LLMs, being black-box models, may generate inaccurate information and lack the ability to answer questions accurately and consistently [19]. In contrast, KGs are generally considered to generate accurate and consistent information, albeit with room for improvement in creativity and user experience [20]. The combination of LLMs and KGs has garnered attention and can be widely implemented in various industrial applications and Q&A systems.

For example, KGs can provide LLMs with external knowledge during pre-training and reasoning phases, thereby enhancing interpretability, while LLMs can facilitate data annotation and augmentation in KG construction, supporting the extraction of entities and relationships [21,22,23]. Fine-tuning LLMs using manually constructed templates may lead to uneven improvements due to discrete tokens [24]. Therefore, there is a need to develop a standardized and effective cue-word dataset leveraging KGs, particularly in lunar surface exploration, where information processing differs from typical shop-floor machining or assembly processes [25,26]. This involves effective supervision and interaction between human and autonomous/semi-autonomous systems, necessitating a focus on information such as human SA and agent transparency [27,28].

Based on the abovementioned issues, this paper proposes a KG-enhanced, specialized LLM to assist astronauts in localizing and addressing mission faults. First, knowledge is extracted from mission documents and expert interviews to construct the KG. Also, a user request analysis process is established using the BiLSTM + CRF model, which generates a multi-round dialog dataset tailored for lunar exploration. Next, the dataset is augmented using a p-tuning approach, which enhances the expertise of the base LLM for lunar surface exploration, producing specialized LLMs for this context. Finally, three types of prompts based on SA theory are developed, enabling fault localization and measure generation through natural language interaction with astronauts.

The rest of the paper is structured as follows. Section 2 presents a review of the related literature. Section 3 describes a methodology for embedding a KG for lunar surface exploration into a base LLM, enhancing the LLM to assist astronauts in resolving obstacles during lunar exploration. Section 4 studies and analyzes a case study. Section 5 summarizes this study and provides an outlook for future work.

2. Related Work

This section provides an overview of related work from three perspectives: (1) human–robot collaborative design and failure recovery in lunar surface exploration, discussing the necessity of dialog system design for lunar exploration missions; (2) joint KG and LLM systems, providing an overview of current developments in dialog system technology; and (3) human–robot interface design in collaboration with intelligent agents, exploring how to enhance astronaut interaction after the dialog system is implemented.

2.1. Human–Robot Collaborative Design and Failure Recovery in Lunar Surface Exploration

In recent years, lunar surface exploration missions have achieved significant results, but they remain in the exploratory phase [12,29]. China’s Chang’e-4 mission successfully targeted the far side of the Moon and deployed Yutu-2, a teleoperated rover, to explore the interior of the Von Kármán crater, providing valuable data on weathered layers, craters, and rocks [30]. NASA’s Perseverance rover, designed to operate autonomously on Mars, is equipped with a robotic arm that has set several planetary rover records [31,32]. However, rovers are prone to deviating from their pre-determined routes or encountering unknown and impassable complex terrain on their routes (e.g., deep sands, debris piles) and need to identify and address potential risks in unknown areas. However, rovers are prone to deviating from their pre-determined routes or encountering unknown and impassable terrains, such as deep sands or debris piles, and need to identify and address potential risks in these uncharted areas.

Lamarre et al. proposed a method for computing a recovery strategy that maximizes the survival probability of a solar rover under various startup conditions [33]. Vanegas et al. introduced a lunar surface emergency system that includes an Environmental Control and Life Support System (ECLSS) on the rover, featuring a small inflatable shelter around the passenger seat that can be inflated to protect astronauts when needed [34]. Hirabayashi et al. evaluated three dustproofing techniques for in situ lunar exploration based on six operational factors [35]. However, these systems primarily focus on successfully executing complex tasks or recovering partially faulty nodes in the robot, without providing methods for diagnosing faulty nodes and generating corrective measures throughout the entire mission cycle.

On the other hand, some scholars have attempted to accomplish lunar surface exploration missions through more autonomic methods, such as failure-safe motion planning and mesh topology communication networks [36,37]. Instead, we are more concerned that autonomic research will lead to a higher workload and negative emotions for astronauts. This is because astronauts must address numerous open-ended scientific questions, and the operation of the human–robot loop system is prone to deviations from the expected plan, while the mission requires astronauts to complete tasks efficiently and safely [38]. LLMs, as auxiliary systems, can offer effective dialogue solutions in contexts involving complexity, rapid decision-making, and safety requirements [39]. Therefore, we propose investigating how LLMs can assist astronauts in quickly and accurately localizing faults throughout the entire mission cycle and generating corresponding corrective measures through an intelligent Q&A system.

2.2. Joint KG and LLM Systems

In Section 2.1, intelligent Q&A systems exhibit advantages in assisting astronauts with lunar surface exploration tasks. Three common technical approaches for Q&A systems include KGs, LLMs, and their combination. KG-based systems often involve sequence annotation tasks, such as using BiLSTM + CRF for named entity recognition (NER) tasks, and Q&A through a KG to enhance the semantic parsing accuracy and interpretability of natural language questions [40,41]. Although KGs offer structured knowledge with strong interpretability in Q&A systems, they face challenges such as high construction costs, data incompleteness, and limited natural language processing (NLP) capabilities. In contrast, LLMs are typically trained on general-purpose domain corpora (e.g., the Wikipedia corpus) and encounter limitations in specialized applications like lunar surface exploration. While LLMs can understand and generate natural language, their knowledge is parameterized as implicit knowledge, which may lead to fabricated facts and lacks interpretability [19].

The cross-domain Q&A capability of LLMs can be enhanced by fine-tuning with Q&A data generated from KGs [42]. Li et al. introduced a method for construction plan compliance checking that integrates KGs and LLMs [43]. Chen et al. developed the “Enhancing Emergency Decision-Making with KG and LLM” system, which offers evidence-based decision support during various emergency response phases [44]. Zhu et al. proposed an LLM–KG joint framework that can interact with geographic information systems to enhance public awareness of flood risks [45]. Zhou et al. presented a system based on industrial structure causal knowledge augmentation of LLMs for analyzing the causes of manufacturing quality defects in aerospace products [21]. However, there is a notable gap in current research concerning the ergonomics of KG–LLM integration, particularly in lunar surface exploration, where the design of Q&A systems in collaboration with intelligent agents must prioritize professionalism and reliability.

2.3. Human–Robot Interface Design in Collaboration with Intelligent Agents

Collaboration between humans and intelligent agents presents significant challenges for effective supervision and interaction, as human SA capabilities decrease with the robot’s increasing intelligence level, resulting in a lower likelihood of successfully taking over a task when needed [27]. This automation conundrum has suggested interventions in the design of human–robot interaction interfaces, especially in human–robot dialogues. Jiang et al. investigated the impact of three types of post hoc explanations (alternative advice, prediction confidence scores, and prediction rationale) on two context-specific user decision outcomes (AI advice acceptance and advice adoption) in human–robot dialogues [46]. Berzuk et al. proposed a framework for describing human–robot interaction dialogues, collecting data from 75 published systems to distill the extensive work into the key underlying factors, contained in design and implementation aspects, and six dimensions within these [47]. Sakai et al. developed a dialog system that leverages implicit bot proponents and conducted field experiments to assess the impact of the meaning and number of robots [48]. Chen et al. developed a Situation awareness-based Agent Transparency (SAT) model to support human operators’ understanding of the mission, by enhancing their SA through collaboration with intelligent agents [49]. Therefore, there is a need to consider how the interface information of a human–robot dialog system can be designed to enhance astronauts’ SA and trust in the system during lunar surface exploration.

Prompt design may provide a technological pathway for integrating KGs with LLMs to assist astronauts in diagnosing and making decisions about faulty nodes. Prompts can better understand and generate human language without requiring specialized model training for each task. Through carefully crafted prompts, LLMs can be encouraged to generate more professional and accurate texts, stimulating their creativity and imagination [50]. Specifically, prompts can be used to design multi-level prompt controls in conjunction with SA theory when using LLMs to parse faulty nodes from multiple levels. This approach not only enhances the performance of LLMs, but also helps astronauts understand the reasoning behind intelligent agents’ decisions, facilitating more intuitive and user-friendly interactions. However, current research on the integration of KGs and LLMs, to the best of our knowledge, has rarely considered this from the perspective of human–intelligent-agent interaction, particularly in lunar exploration missions.

In summary, the above method makes it possible to integrate KGs and LLMs for fine-tuning and prompt design guided by SA theory to support human–robot dialogues in lunar surface exploration. In addition, it provides a feasible solution for faulty node localization and decision-making by tightly integrating human–robot collaboration expertise into a specialized LLM for lunar surface exploration.

3. Methodology of KG-Enhanced LLM for Lunar Surface Exploration

The framework of this paper, shown in Figure 2, is divided into three main stages. First, KG-based p-tuning dataset construction. This involves performing NER on interrogative sentences using the BiLSTM + CRF method and generating the required dataset by embedding subgraphs derived from user request analysis. Second, KG enhances LLM’s p-tuning approach using a p-tuning dataset. Using ChatGLM-6B as the base model, a KG-guided p-tuning strategy is implemented to develop a specialized offline LLM. Third, multi-level prompt design for dialogue. Informed by SA theory, this step involves designing multi-level prompts to collaboratively assist astronauts in identifying fault issues, diagnosing causes, and proposing improvement measures with high accuracy and low cognitive load. Each stage of the study is discussed in detail in subsequent sections.

3.1. KG-Based P-Tuning Dataset Construction

During the collaborative execution of lunar surface exploration tasks, most potential faults are documented in problem investigation sheets and quality inspection reports. These documents contain extensive mission-related data, including various types of failure issues, reasons, and corresponding measures. However, when faults occur, astronauts, who are often under significant cognitive load, may struggle to quickly retrieve and localize relevant information based on their own experience. To address this, decision-making can be enhanced by fine-tuning a base LLM to create a domain-specific LLM. The fine-tuned dataset is generated by KGs, constructed using the documents and by conducting expert interviews. Since the general knowledge inherent to large models cannot be effectively augmented using only KG subgraph triples, the dataset is further refined by embedding subgraphs derived from the user request analysis process. These embeddings are utilized as prefixes to optimize each layer of the LLM. The process of user request analysis, which considers document types and categorizations, is illustrated in Figure 3.

User request analysis comprises five modules. The first module, i.e., question, records various interrogative sentences that users may ask. Next, the questions undergo NER and question classification sequentially. The NER module, based on a BiLSTM + CRF model, identifies key entities within the interrogative sentence, while the question classification module determines the question type based on these keywords. Subsequently, the extracted keywords and question types are processed in the question parsing module, which interacts with the KG to retrieve the relevant subgraph and embed it into an answer template. Finally, the system generates and outputs answers enriched by the subgraphs.

Next, we provide a detailed analysis of how the three intermediate modules between questions and answers influence model performance. We begin by introducing the NER model, followed by the question classification and question parsing models. The NER model is illustrated in Figure 3. The input consists of eight characters,

x_{1}, x_{2}, \dots x_{8}

, each associated with a corresponding label. The goal of training optimization is to assign a predicted label to each word and minimize the difference between the predicted and true labels. First, the embedding of each word is processed as an input. The model then employs a bidirectional LSTM to extract the semantic representation vector of each word within its context. Finally, a CRF layer decodes these semantic vectors, enabling the multi-class classification of each word to identify recognized entities and their categories.

The Bi-directional Long Short-Term Memory (BiLSTM) model combines a forward LSTM and a backward LSTM, enabling more effective processing of text information. A single LSTM model comprises a forget gate, an input gate, and an output gate. Its mechanism operates as follows: the inputs include the hidden state

h_{(t - 1)}

from the previous time step, the memory cell

c_{(t - 1)}

from the previous time step, and the current input

x_{(t)}

. The forget gate determines which information in

c_{(t - 1)}

should be discarded based on its calculations:

f_{(t)} = s i g m o i d (W_{f} [h_{(t - 1)} {, x}_{(t)}] + b_{f})

(1)

The information updated by the candidate memory cell

{\tilde{c}}_{(t)}

is determined through the computation of the input gate, while the tanh function is responsible for the creation of a new information candidate vector.

i_{(t)} = s i g m o i d (W_{i} [h_{(t - 1)}, x_{(t)}] + b_{i})

(2)

{\tilde{c}}_{(t)} = t a n h (W_{c} [h_{(t - 1)}, x_{(t)}] + b_{c})

(3)

The memory cell

c_{(t)}

is updated by summing the above parameters; finally,

o_{(t)}

is obtained through the output gate and multiplied by

c_{(t)}

to obtain the final output value

h_{(t)}

.

c_{(t)} = f_{(t)} * c_{(t - 1)} + i_{(t)} * {\tilde{c}}_{(t)}

(4)

o_{(t)} = s i g m o i d (W_{o} [h_{(t - 1)}, x_{(t)}] + b_{o})

(5)

h_{(t)} = t a n h (c_{(t)}) * o_{(t)}

(6)

The probability matrix

S_{n * k}

output by the BiLSTM model may result in invalid label sequences. In contrast, the Conditional Random Field (CRF) model maintains a probability transition matrix

M

, where

M_{y_{i - 1}, y_{i}}

represents the scores for transitioning from label

y_{i - 1}

to label

y_{i}

. Unlike the common softmax function, the CRF constrains label prediction during decoding based on this transition matrix, outputting the label sequence

\hat{y}

with the highest score as the final labeling result.

S c o r e (x, y) = \sum_{i = 1}^{n} M_{y_{i - 1}, y_{i}} + \sum_{i = 1}^{n} S_{i, y_{i}}

(7)

\hat{y} = a r g m a x S c o r e (x, y)

(8)

The question classification model is a text categorization approach based on keyword extraction and matching. Keywords are first extracted from the question, and then the question is categorized into an appropriate class based on the association between these keywords and predefined categories. Question classification is essential because NER alone cannot uniquely determine question types. For example, the relationship between a task and an interaction device may fall into categories such as “recommended”, “available”, or “caution”. While NER can identify device entities and their associated elements, it cannot determine their relationship. Additionally, due to a significant category imbalance in the corpus, where some question categories are overrepresented, a text categorization model is not used. Instead, heuristic keyword-based rules are employed for classification.

The question parsing model is an approach used to understand and analyze user intent by combining NER and question classification. It analyzes the question into structured semantic information to explicitly generate knowledge subgraphs, which are then embedded to produce answers. Subgraphs corresponding to user queries are identified by constructing precise query statements and generating corresponding SQL statements.

Specifically, the BiLSTM + CRF information extraction model, combined with the question classification function, generates question types and entities. Next, the question parsing function is used to construct subgraph-embedded SQL statements. Finally, the corresponding answers are generated based on the question type templates. The specific algorithm flow is shown in Algorithm 1.

Algorithm 1: Question and answer modeling with knowledge subgraph embedding

Input: The question q = [x1, x2, …, xn] denotes a question consisting of n characters.
Output: Obtain answer A, which is generated by embedding through a knowledge subgraph.

Initialize Question Classifier, Question Parser, and Answer Searcher functions and load feature words to build the feature word list.
NER the question and record the entity X1 and its type Y1 identified in the query.
if the question contains the feature word and Y1 in the corresponding type,
question_type is categorized as the corresponding question type T1,
different keywords and types are merged to form dictionary data [Ti], i ∈ {0, 1, …, n}.
for question_type in [Ti]
if question_type == T_i,
Get the corresponding query statement sql for embedding the KG.
Merge to form dictionary data [sql_j], j ∈ {0, 1, …, n}.
for sql in [sql_j]
if T_i == sql_j,
Generate subgraphs by associating each entity X_i and type Y_i according to sql_j.
Embed subgraphs into the question type templates to generate the answer A.
print (A)
end

3.2. KG Enhances LLM’s P-Tuning Approach

To better assist astronauts in addressing issues arising during human–robot task execution, an offline specialized LLM can be designed. Using ChatGLM-6B, a dialogue model tailored for Chinese scenarios, as the base model, and the dataset constructed from the KG in Section 3.1 as the external corpus, the model is fine-tuned using the p-tuning v2 method [51]. By freezing the parameters ∅ of the base model in the gray section and training the prefix tuning parameters θ in the orange section, the model’s knowledge inference performance is significantly improved with no more than 3% of the base model’s parameters, as shown in Figure 4.

Specifically, the model input after adding the prefix is denoted as

z = [P R E F I X; a; {P R E F I X}^{’}; b]

, where

{P R E F I X}_{i d x}

,

A_{i d x}

, and

B_{i d x}

represent the idx positions of

P R E F I X

,

a

, and

b

, respectively. A matrix

P_{θ} \in R^{|P_{i d x}| * d i m (C_{n})}

storing the prefix parameters is also constructed, where

P_{i d x}

denotes the ordinal number of the prefix index,

| P_{i d x} |

denotes the length of the prefix, and

C_{n}

denotes the output of the

n

th hidden layer. This allows fine-tuning of the dialogue model’s output by defining trainable prompt embeddings, which are passed to the transformer along with the input text’s word vectors for computation. The output computation is shown below.

C_{n} = {\begin{matrix} P_{θ} [n, :], i f n ϵ P_{i d x} \\ {L M}_{\emptyset} (z_{n}, C_{< n}), o t h e r w i s e \end{matrix}

(9)

where

C_{< n}

denotes the hidden layer before

C_{n}

. Further, the training objective function fine-tuned for the pre-tuning parameter

θ

is as follows:

\max_{\emptyset} l o g P_{θ, \emptyset} (b | a) = \sum_{n ϵ B_{i d x}} \log P_{θ, \emptyset} (z_{n} | C_{< n}),

(10)

where

P_{θ, \emptyset}

is the trainable probability distribution of the model. This computation allows the model to train cue parameters based on the task loss

θ

without changing the original parameters. By optimizing no more than 3% of the parameters, the model is applicable to lunar surface exploration tasks without modifying the model architecture. Subsequently, while directly updating

P_{θ, \emptyset}

may lead to optimization instability and performance degradation, the results are obtained through the reparameterization matrix

P_{θ} [n, :]

=

{M L P}_{θ} ({P ’}_{θ} [n, :])

, while retaining only

P_{θ, \emptyset}

[52].

Using the above method, the parameters of the prefixed part can be designed as the matrix to be trained, and fine-tuning can be realized after training and re-prefixing to the base model. This is possible because the parameters of the base model are frozen during the training process without changing its parameters. The fine-tuning method using the KG-enhanced LLM is also realized to form a specialized LLM for lunar surface exploration.

3.3. Multi-Level Prompt Design for Dialogue

Although LLMs have been trained on vast amounts of textual data and demonstrate strong language generation capabilities, efficiently and accurately controlling model outputs remains a challenge. Prompts, which, in this paper, include both instruction and input text information, can guide the results generated by LLMs through the design of high-quality textual prompts. This can significantly enhance the relevance, consistency, and controllability of the model output. In human–robot dialogue, prompts can greatly improve the efficiency and quality of interactions. As shown in Figure 5, through clear and specific instructions, prompts can guide the specialized LLM to generate more accurate and relevant responses, reducing unnecessary multiple rounds of dialogue, thereby decreasing astronaut workload and improving efficiency.

Specifically, the prompt design is structured into three levels, aligning with the three levels of SA theory: “real-time data”, “task analysis”, and “measure recommendation”. SA level 1, the perception layer, focuses on identifying and acquiring key environmental information, corresponding to the system state (fault condition) in prompt I. SA level 2, the comprehension layer, involves interpreting the meaning of this data, which aligns with the underlying cause of the fault in prompt II. SA level 3, the projection layer, predicts the current situation’s possible or probable future state, corresponding to the potential measures for the fault in prompt III [53]. The corresponding instruction messages are carefully crafted as follows: “What are the possible problems of the current task?”, “What are the possible reasons behind the current problem?”, and “What are the possible measures to solve the current problem?”

In the “real-time data” phase, the astronaut can input the mission node that is currently malfunctioning and select prompt I to guide the LLM in accurately generating the possible problems within the mission. In the “task analysis” phase, the astronaut can enter the possible problems identified in the previous round of dialogue and select prompt II to guide the LLM in identifying the potential reasons behind those problems. In the “measure recommendation” phase, the astronaut can input the possible reasons from the previous dialogue and select prompt III to help the LLM recommend potential measures. These three consecutive rounds of questioning assist the astronauts in identifying the possible problems, reasons, and measures related to the current mission obstacle.

Predefined instruction information is optional, as astronauts may choose to customize the input text for specific malfunctions, or parameters provided by sensors may assist in the decision-making process. For example, during the questioning process, if the problem involves identifying multiple targets, astronauts can draw conclusions directly from the camera feed.

4. Case Study

Lunar surface exploration missions involve a variety of activities, aimed at collecting data to gain a deeper understanding of the Moon’s geology, environment, and other characteristics. The mission environment is harsh, and failure scenarios are often complex. As a result, effective dialogue with astronauts is essential for diagnosing problems, identifying causes, and determining appropriate measures. This includes critical information needed to decide on the final solution, such as team composition, degree of synergy, and other factors. In this study, we use a lunar surface exploration mission as a case study to evaluate the feasibility of the proposed model.

4.1. Experimental Setting

The development environment for this study is Ubuntu 20.04.2, with an Intel^® Xeon^® Gold 6226R CPU @ 2.90GHz, 128 GB of RAM, and an NVIDIA GeForce RTX 4090 24GB. We used Python 3.9.16 and PyTorch 2.0.0 for development, and the Neo4j graph database 5.22.0 to store KG entities and relationships, building a navigable KG. Drawing on the five key elements of “human–machine–material–method–environment” and the specific characteristics of lunar surface missions, we identified 12 types of entities: task, device, goal, human, robot, infrastructure, terrain, problem, measure, reason, synergy, and team formation. Additionally, we defined 13 types of relationships among these entities, including follow, define, recommend, available, caution, need, located, contain, correspond, related, complete, cause, and solve.

Based on the data construction methodology outlined in Section 3.1, we created over 4000 high-quality training datasets using KGs to fine-tune the base model. Figure 6 provides an example of data cues related to human–robot collaboration in lunar surface exploration. The training dataset consists of three parts: “question”, “response”, and “history”. Specifically, the “question” provides the relevant instructions for the query, highlighted in red. The “response” is the answer to the query, which corresponds to the knowledge triples shown in blue. The “history” includes the chat history, enabling the model to be fine-tuned using data from multiple rounds of conversations. The p-tuning training process employs the Adam optimizer, with a maximum input token length of 512 and a maximum output token length of 256, using a learning rate of 2e-2. For the NER model, task documents are used as training data, and the training process employs the Adam optimizer, with the LSTM hidden sizes set to 128. The model is trained for 50 epochs using a batch size of 32, with a learning rate set to 0.001, and a dropout rate of 0.5 is applied to prevent overfitting.

While ChatGPT and GPT-4 have demonstrated a strong performance on many tasks, a significant amount of mission data from lunar surface exploration is subject to strict confidentiality requirements. Due to this limitation, we are unable to provide these models with failure-related data. Furthermore, the nature of the collected task documents, which are fault-related and consist of unique question–answer pairs, makes it inappropriate to randomly divide the dataset into training and test sets for evaluations such as BLEU-4 and ROUGE-1, due to the non-repetitive nature of the data. Additionally, both ChatGPT and GPT-4 are proprietary, non-open-source models, making it impossible for us to quantitatively assess them. Allowing free-form questions would complicate the standardization of assessment criteria and make it difficult to ensure that the difficulty level of questions asked by different experts is consistent. For a fair comparison, we used the responses to the same specific questions as benchmarks to evaluate the selected models, including ChatGPT, GPT-4, ChatGLM-6B, and the proposed LLM.

We invited human–robot collaboration professionals, who were not involved in KG generation, to rank the outputs of each model in response to 30 fault-related questions. The 30 questions were selected by experts involved in the KG generation, with the selection criteria based on the design of the three levels of the prompt. Ten representative questions were carefully chosen for each level to ensure they were typical, relevant, and comprehensive in covering the key failure types. Models were scored on a scale from 1 to 10 for their responses to each question, with a maximum total score of 300. The output selected the best of three rounds. The criteria for assessing the 30 questions are as follows:

Correctness: does the model’s response align with known correct answers or expert knowledge?
Completeness: are there any factual errors? Does the response cover all key points of the question without omitting important information?
Relevance: is the response highly relevant to the question? Does it avoid irrelevant or redundant information?
Clarity: is the response clear and fluent? Does it conform to the reading habits of professionals?
Actionability (for measure recommendations): are the provided solutions feasible and actionable for actual tasks?

Pre-experimental training was conducted, where professionals were provided with examples to illustrate the principles and applications of the dialogue model prompt design. The scoring criteria, as described above, were clearly communicated to the participants, who were instructed on how to score the model’s responses based on these criteria.

4.2. Experimental Results and Analysis

A comparison of the experimental results is shown in Figure 7. Each bar represents the evaluation score of a model, with the score expressed as a percentage. This demonstrates the effectiveness of constructing high-quality Q&A datasets using KGs and fine-tuning with the p-tuning approach. The commercial large-scale language models (ChatGPT and GPT-4) also perform well in addressing lunar surface exploration-related questions, achieving performance only 10.4% and 7.4% lower than that of the proposed model. These results suggest that commercial models can also contribute to solving failure-related issues during lunar surface exploration, to some extent.

To further analyze the differences among the models in terms of answering fault-related questions, specific examples of Q&A are provided in Table 1 to visualize the details. On the left are the questions entered in Chinese, along with the answers generated by the four models. The right side contains the English translations. All responses are based on actual tests of the four models. It can be observed that the models demonstrate varying degrees of expertise in answering the questions, aligning with the evaluation of the different models for fault-related questions presented in Figure 7.

Nevertheless, while ChatGPT and GPT-4 can address specific questions related to lunar surface exploration, their answers tend to be more general. At times, they suggest ideas for improvement that may not be applicable in the current scenario. For example, they might mention designing a new navigation algorithm, which is not a feasible solution at the moment but could be implemented at the end of the mission. That said, the responses can be used not only to answer the lunar exploration-related questions, but can also be applied to other related failure problems. In addition to redundancy, issues arise in parsing certain keywords, such as distinguishing between consequences and reasons, where general LLMs may struggle to differentiate effectively. This challenge may stem from the domain knowledge embedded in the KG and the construction process, which involved experts familiar with SA theory. A general model, however, may not fully activate the relevant dialogue potential through prompt design alone.

In contrast, compared to ChatGPT and GPT-4, the proposed model provides answers that are more suited to the current scenario and highly relevant to the human–robot collaboration challenges in lunar surface exploration. This is because it focuses more on specific problems, reasons, measures, and other critical metrics, information that can be derived from the KG. Such information is not readily available in ChatGPT or GPT-4. Therefore, we believe that incorporating relevant task data into the KG for lunar surface exploration and designing a fine-tuning model through p-tuning can enhance the ability of LLMs to effectively address problem diagnosis, reason analysis, and measure generation in lunar surface exploration tasks.

4.3. Usability and Feasibility Analysis

To illustrate the usability of the proposed model, Figure 8 demonstrates its application in generating answers. The nature of lunar surface exploration missions compels astronauts to confront challenges across multiple domains, while the black-box nature of LLMs may lead to distrust in the results generated. Therefore, there is a need to provide multiple levels of information about the LLM to help astronauts fully understand and trust the content generated by the proposed LLM.

It is assumed that astronauts will only have a general understanding of the current mission phase when faults are detected, along with access to video streams and sensor parameters that are difficult to intuitively quantify. When a malfunction occurs, astronauts can quickly and accurately identify current tasks as “Target Recognition”. Then, possible problems are generated by entering them into the dialogue box and selecting prompt I. After selecting “Failure to find target” from the listed problems and selecting prompt II, potential reasons are generated. By selecting “Moon Dust Obstructing Camera” from the list of reasons and selecting prompt III, possible measures are provided. Different levels of progressive human–robot dialogue are employed to help astronauts understand the reasoning behind the LLM’s measure-generation process, enabling them to more clearly and intuitively grasp the reason behind the problem, rather than simply receiving recommended measures.

We confirm that the proposed model provides a more professional level of response to failure-related problems. Additionally, we assess the model’s feasibility based on time consumption. Specifically, the time required for the model to generate text is divided into three parts: the time consumption for the “Problem-related questions”, “Reason-related questions”, and “Measure-related questions”, each corresponding to 10 fault-related questions, as shown in Figure 9. It can be seen that the time spent on answer generation is around 8 to 15 s. Overall, the response times for all problem-related questions, reason-related questions, and measure-related questions are on the order of seconds, a speed that meets the information processing needs of fault events during lunar surface exploration.

5. Conclusions

In this paper, we propose a human–robot team knowledge-enhanced LLM for analyzing fault-related problems during lunar surface exploration. First, we introduce a user request analysis process to construct a fine-tuned dataset, which is generated by combining NER based on the BiLSTM + CRF and question classification, question parsing, and answer search functions with KG indexing. Next, the fine-tuned dataset is employed to p-tune the expertise of LLMs in lunar surface exploration, resulting in a specialized LLM that enhances fault analysis capabilities. Finally, we design three types of prompts across different levels of perception, reasoning, and decision-making, enabling astronauts to quickly and accurately obtain essential information about fault problems, reasons, and measures.

The case study shows that the proposed LLM provides more specialized and contextually relevant answers to specific questions during lunar surface exploration compared to commercial models such as ChatGPT and GPT-4. For faulty nodes, it effectively assists astronauts in performing feasibility analyses and making informed decisions. However, the model lacks the ability to respond to multimodal data (e.g., visual, speech) during lunar surface exploration.

While the proposed method shows better responses to self-constructed fault problems, it still faces some limitations. The analytical capabilities of the specialized LLM are strongly tied to the quality and granularity of human–robot team knowledge. Extracting more high-quality, fine-grained knowledge is crucial for further improvements. For instance, the proposed model relies on task documents for fine-tuning dataset construction, and it is possible to construct a KG at the operation-level granularity to improve the ability to analyze operational faults. Exploring multimodal data integration is another promising direction, such as implementing speech-based tool invocation or incorporating vision-language models (VLM) to improve interaction and functionality. Additionally, retraining prefix parameters is essential to maintain the expertise and reliability of the specialized LLM, particularly in more dynamic and complex conversational scenarios encountered during lunar surface exploration tasks.

Author Contributions

Conceptualization, H.W., C.W. and Y.F.; methodology, H.W., S.X. and Y.F.; software, C.W. and S.X.; validation, H.W. and H.Z.; formal analysis, H.Z. and Y.F.; investigation, H.Z.; resources, C.W. and S.X.; data curation, C.W. and Y.F.; writing—original draft preparation, H.W.; writing—review and editing, H.W., S.X. and Y.F.; visualization, H.W. and S.X.; supervision, C.W. and H.Z.; project administration, S.X. and Y.F.; funding acquisition, Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Laboratory of Human Factors Engineering Stable Support Fund [No: GJSD22004], for the research on Key Technology and System Implementation of Human–Robot Collaboration in Planetary Exploration.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors would like to express sincere gratitude to the professionals who participated in the interview of KG construction and the evaluation of LLMs for this paper.

Conflicts of Interest

The authors declare that they have no competing interests.

References

Marov, M.Y.; Slyuta, E.N. Early steps toward the lunar base deployment: Some prospects. Acta Astronaut. 2021, 181, 28–39. [Google Scholar] [CrossRef]
Serria, E.; Gadhafi, R.; Almaeeni, S.; Mukhtar, H.; Copiaco, A.; Abd-Alhameed, R.; Lemieux, F.; Mansoor, W. A Review of Lunar Communications and Antennas: Assessing Performance in the Context of Propagation and Radiation. Sensors 2023, 23, 9832. [Google Scholar] [CrossRef] [PubMed]
Tonasso, R.; Tataru, D.; Rauch, H.; Pozsgay, V.; Pfeiffer, T.; Uythoven, E.; Rodríguez-Martínez, D. A lunar reconnaissance drone for cooperative exploration and high-resolution mapping of extreme locations. Acta Astronaut. 2024, 218, 1–17. [Google Scholar] [CrossRef]
Chien, S.A.; Visentin, G.; Basich, C. Exploring beyond Earth using space robotics. Sci. Robot. 2024, 9, eadi6424. [Google Scholar] [CrossRef]
Vaquero, T.S.; Daddi, G.; Thakker, R.; Paton, M.; Jasour, A.; Strub, M.P.; Swan, R.M.; Royce, R.; Gildner, M.; Tosi, P.; et al. EELS: Autonomous snake-like robot with task and motion planning capabilities for ice world exploration. Sci. Robot. 2024, 9, eadh8332. [Google Scholar] [CrossRef] [PubMed]
Liu, C.C.; Tang, D.B.; Zhu, H.H.; Nie, Q.W.; Chen, W.; Zhao, Z. An augmented reality-assisted interaction approach using deep reinforcement learning and cloud-edge orchestration for user-friendly robot teaching. Robot. Comput.-Integr. Manuf. 2024, 85, 102638. [Google Scholar] [CrossRef]
Morrell, B.J.; da Silva, M.S.; Kaufmanna, M.; Moon, N.; Kim, T.; Lei, X.M.; Patterson, C.; Uribe, J.; Vaquero, T.S.; Correa, G.J.; et al. Robotic exploration of Martian caves: Evaluating operational concepts through analog experiments in lava tubes. Acta Astronaut. 2024, 223, 741–758. [Google Scholar] [CrossRef]
Sidaoui, A.; Daher, N.; Asmar, D. Human-Robot Interaction via a Joint-Initiative Supervised Autonomy (JISA) Framework. J. Intell. Robot. Syst. 2022, 104, 51. [Google Scholar] [CrossRef]
Rimani, J.; Viola, N.; Lizy-Destrez, S. Simulating Operational Concepts for Autonomous Robotic Space Exploration Systems: A Framework for Early Design Validation. Aerospace 2023, 10, 408. [Google Scholar] [CrossRef]
Bi, J.H.; Jin, A.; Chen, C.; Ying, S. Enhanced Interactive Rendering for Rovers of Lunar Polar Region and Martian Surface. Remote Sens. 2024, 16, 1270. [Google Scholar] [CrossRef]
Mazarico, E.; Barker, M.K.; Jagge, A.M.; Britton, A.W.; Lawrence, S.J.; Bleacher, J.E.; Petro, N.E. Sunlit pathways between south pole sites of interest for lunar exploration. Acta Astronaut. 2023, 204, 49–57. [Google Scholar] [CrossRef]
Rimani, J.; Bucchioni, G.; Ryals, A.D.; Viola, N.; Lizy-Destrez, S. Integrated Conceptual Design and Parametric Control Assessment for a Hybrid Mobility Lunar Hopper. Aerospace 2023, 10, 669. [Google Scholar] [CrossRef]
Wright, J.L.; Lakhmani, S.G.; Chen, J.Y.C. Bidirectional Communications in Human-Agent Teaming: The Effects of Communication Style and Feedback. Int. J. Hum.-Comput. Interact. 2022, 38, 1972–1985. [Google Scholar] [CrossRef]
Karakikes, M.; Nathanael, D. The effect of cognitive workload on decision authority assignment in human-robot collaboration. Cogn. Technol. Work 2023, 25, 31–43. [Google Scholar] [CrossRef]
Liu, P.F.; Qian, L.; Zhao, X.W.; Tao, B. Joint Knowledge Graph and Large Language Model for Fault Diagnosis and Its Application in Aviation Assembly. IEEE Trans. Ind. Inform. 2024, 20, 8160–8169. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
GLM, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Zhao, H. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
Wang, C.; Liu, Y.Y.; Guo, T.Z.; Li, D.P.; He, T.; Li, Z.; Yang, Q.W.; Wang, H.H.; Wen, Y.Y. Systems engineering issues for industry applications of large language model. Appl. Soft. Comput. 2024, 151, 111165. [Google Scholar] [CrossRef]
Peng, C.Y.; Xia, F.; Naseriparsa, M.; Osborne, F. Knowledge Graphs: Opportunities and Challenges. Artif. Intell. Rev. 2023, 56, 13071–13102. [Google Scholar] [CrossRef]
Zhou, B.; Li, X.Y.; Liu, T.Y.; Xu, K.Z.; Liu, W.; Bao, J.S. CausalKGPT: Industrial structure causal knowledge-enhanced large language model for cause analysis of quality problems in aerospace product manufacturing. Adv. Eng. Inform. 2024, 59, 102333. [Google Scholar] [CrossRef]
Yang, L.Y.; Chen, H.Y.; Li, Z.; Ding, X.; Wu, X.D. Give us the Facts: Enhancing Large Language Models With Knowledge Graphs for Fact-Aware Language Modeling. IEEE Trans. Knowl. Data Eng. 2024, 36, 3091–3110. [Google Scholar] [CrossRef]
Pan, S.R.; Luo, L.H.; Wang, Y.F.; Chen, C.; Wang, J.P.; Wu, X.D. Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT understands, too. AI Open 2024, 5, 208–215. [Google Scholar] [CrossRef]
Chen, L.; Xu, J.H.; Wu, T.Y.; Liu, J. Information Extraction of Aviation Accident Causation Knowledge Graph: An LLM-Based Approach. Electronics 2024, 13, 3936. [Google Scholar] [CrossRef]
Zhou, B.; Bao, J.S.; Chen, Z.Y.; Liu, Y.H. KGAssembly: Knowledge graph-driven assembly process generation and evaluation for complex components. Int. J. Comput. Integr. Manuf. 2022, 35, 1151–1171. [Google Scholar] [CrossRef]
Endsley, M.R. From Here to Autonomy: Lessons Learned From Human-Automation Research. Hum. Factors 2017, 59, 5–27. [Google Scholar] [CrossRef]
Chen, J.Y.C.; Barnes, M.J.; Wright, J.L.; Stowers, K.; Lakhmani, S.G. Situation awareness-based agent transparency for human-autonomy teaming effectiveness. In Proceedings of the Conference on Micro- and Nanotechnology (MNT) Sensors, Systems, and Applications IX, Anaheim, CA, USA, 9–13 April 2017. [Google Scholar]
Zhang, T.; Pang, Y.; Zeng, T.; Wang, G.X.; Yin, S.; Xu, K.; Mo, G.D.; Zhang, X.W.; Wang, L.S.; Yang, S.; et al. Robotic drilling for the Chinese Chang’E 5 lunar sample-return mission. Int. J. Robot. Res. 2023, 42, 586–613. [Google Scholar] [CrossRef]
Ding, L.; Zhou, R.; Yuan, Y.; Yang, H.; Li, J.; Yu, T.; Liu, C.; Wang, J.; Li, S.; Gao, H.; et al. A 2-year locomotive exploration and scientific investigation of the lunar farside by the Yutu-2 rover. Sci. Robot. 2022, 7, eabj6660. [Google Scholar] [CrossRef]
Verma, V.; Maimone, M.W.; Gaines, D.M.; Francis, R.; Estlin, T.A.; Kuhn, S.R.; Rabideau, G.R.; Chien, S.A.; McHenry, M.M.; Graser, E.J.; et al. Autonomous robotics is driving Perseverance rover’s progress on Mars. Sci. Robot. 2023, 8, eadi3099. [Google Scholar] [CrossRef]
Fries, M.D.; Lee, C.; Bhartia, R.; Hollis, J.R.; Beegle, L.W.; Uckert, K.; Graff, T.G.; Abbey, W.; Bailey, Z.; Berger, E.L.; et al. The SHERLOC Calibration Target on the Mars 2020 Perseverance Rover: Design, Operations, Outreach, and Future Human Exploration Functions. Space Sci. Rev. 2022, 218, 46. [Google Scholar] [CrossRef]
Lamarre, O.; Malhotra, S.; Kelly, J. Recovery policies for safe exploration of lunar permanently shadowed regions by a solar-powered rover. Acta Astronaut. 2023, 213, 708–724. [Google Scholar] [CrossRef]
Vanegas, M.; Kotowick, K.; LaTour, P.; Curry, M.; Foley, J.; Hoffman, J.; Schreiner, S.; Setterfield, T.; Geiger, L.; Barmore, D.; et al. An Emergency Mitigation System for Safer Lunar Surface Exploration. IEEE Aerosp. Electron. Syst. Mag. 2019, 34, 42–52. [Google Scholar] [CrossRef]
Hirabayashi, M.; Hartzell, C.M.; Bellan, P.M.; Bodewits, D.; Delzanno, G.L.; Hyde, T.; Konopka, U.; Thomas, E.; Thomas, H.M.; Hahn, I.; et al. Electrostatic dust remediation for future exploration of the Moon. Acta Astronaut. 2023, 207, 392–402. [Google Scholar] [CrossRef]
Park, B.J.; Chung, H.J. Deep Reinforcement Learning-Based Failure-Safe Motion Planning for a 4-Wheeled 2-Steering Lunar Rover. Aerospace 2023, 10, 219. [Google Scholar] [CrossRef]
Fink, W.; Fuhrman, C.; Zuniga, A.N.; Tarbell, M. A Hansel & Gretel breadcrumb-style dynamically deployed communication network paradigm using mesh topology for planetary subsurface exploration. Adv. Space Res. 2023, 72, 518–528. [Google Scholar] [CrossRef]
Cilliers, J.; Hadler, K.; Rasera, J. Toward the utilisation of resources in space: Knowledge gaps, open questions, and priorities. npj Microgravity 2023, 9, 22. [Google Scholar] [CrossRef] [PubMed]
Wan, P.X.; Huang, Z.G.; Tang, W.J.; Nie, Y.L.; Pei, D.J.; Deng, S.F.; Chen, J.; Zhou, Y.Z.; Duan, H.R.; Chen, Q.Y.; et al. Outpatient reception via collaboration between nurses and a large language model: A randomized controlled trial. Nat. Med. 2024, 30, 2878–2885. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Li, J.C.; Luo, X.D.; Lu, G.Q. GS-CBR-KBQA: Graph-structured case-based reasoning for knowledge base question answering. Expert Syst. Appl. 2024, 257, 125090. [Google Scholar] [CrossRef]
Buehler, M.J. MechGPT, a Language-Based Strategy for Mechanics and Materials Modeling That Connects Knowledge Across Scales, Disciplines, and Modalities. Appl. Mech. Rev. 2024, 76, 021001. [Google Scholar] [CrossRef]
Li, H.; Yang, R.Z.; Xu, S.S.; Xiao, Y.; Zhao, H.Y. Intelligent Checking Method for Construction Schemes via Fusion of Knowledge Graph and Large Language Models. Buildings 2024, 14, 2502. [Google Scholar] [CrossRef]
Chen, M.Z.; Tao, Z.X.; Tang, W.T.; Qin, T.X.; Yang, R.; Zhu, C.L. Enhancing emergency decision-making with knowledge graphs and large language models. Int. J. Disaster Risk Reduct. 2024, 113, 104804. [Google Scholar] [CrossRef]
Zhu, J.; Dang, P.; Cao, Y.G.; Lai, J.B.; Guo, Y.K.; Wang, P.; Li, W.L. A flood knowledge-constrained large language model interactable with GIS: Enhancing public risk perception of floods. Int. J. Geogr. Inf. Sci. 2024, 38, 603–625. [Google Scholar] [CrossRef]
Jiang, J.; Kahai, S.; Yang, M. Who needs explanation and when? Juggling explainable AI and user epistemic uncertainty. Int. J. Hum.-Comput. Stud. 2022, 165, 102839. [Google Scholar] [CrossRef]
Berzuk, J.M.; Young, J.E. More than words: A Framework for Describing Human-Robot Dialog Designs. In Proceedings of the 17th Annual ACM/IEEE International Conference on Human-Robot Interaction (HRI), Sapporo, Japan, 7–10 March 2022; pp. 393–401. [Google Scholar]
Sakai, K.; Ban, M.; Mitsuno, S.; Ishiguro, H.; Yoshikawa, Y. Leveraging the Presence of Other Robots to Promote Acceptability of Robot Persuasion: A Field Experiment. IEEE Robot. Autom. Lett. 2024, 9, 9813–9819. [Google Scholar] [CrossRef]
Chen, J.Y.; Lakhmani, S.G.; Stowers, K.; Selkowitz, A.R.; Wright, J.L.; Barnes, M. Situation awareness-based agent transparency and human-autonomy teaming effectiveness. Theor. Issues Ergon. Sci. 2018, 19, 259–282. [Google Scholar] [CrossRef]
Hu, Z.J.; Yang, P.; Jiang, Y.S.; Bai, Z.J. Prompting large language model with context and pre-answer for knowledge-based VQA. Pattern Recognit. 2024, 151, 110399. [Google Scholar] [CrossRef]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.L.; Du, Z.; Yang, Z.; Tang, J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv 2021, arXiv:2110.07602. [Google Scholar]
Li, X.L.S.; Liang, P.; Assoc Computat, L. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the Joint Conference of 59th Annual Meeting of the Association-for-Computational-Linguistics (ACL)/11th International Joint Conference on Natural Language Processing (IJCNLP)/6th Workshop on Representation Learning for NLP (RepL4NLP), Online, 1–6 August 2021; pp. 4582–4597. [Google Scholar]
Endsley, M.R. Supporting Human-AI Teams:Transparency, explainability, and situation awareness. Comput. Hum. Behav. 2023, 140, 107574. [Google Scholar] [CrossRef]

Figure 1. Typical mission process for human–robot collaboration in lunar surface exploration.

Figure 2. A framework for an MIKG-enhanced LLM human–robot dialogue in lunar surface exploration.

Figure 3. User request analysis process.

Figure 4. KG enhances LLM’s p-tuning approach.

Figure 5. Multi-level prompt design schematic.

Figure 6. Example of a p-tuning dataset generated using KG.

Figure 7. Experimental scoring results figure.

Figure 8. Application examples of the proposed prompt design.

Figure 9. Response time consumption figure.

Table 1. Examples of Q&A generated by the four LLM models.

Question:	Example (in Chinese)	Example(in Corresponding English)
Question:	作为专家系统，分析人机团队执行装配目标识别失效时可能的问题？	As an expert system, analyze the possible problems that occur when human–robot teams fail to perform assembly target identification.
GPT-4	人机团队执行装配时，若目标识别失效，可能导致以下问题：装配错误：无法准确识别目标，导致装配错误，影响产品质量。进度延误：识别失效需要重复识别或人工干预，延长装配时间。质量问题：错误识别可能导致装配不合格，增加返工或报废率。资源浪费：识别失效增加时间和物料浪费，降低效率。安全风险：错误识别可能引发碰撞或错误装配，带来安全隐患。协同问题：机器人无法向操作员提供准确信息，影响团队协作。这些问题会影响装配质量、效率和安全性，甚至导致任务失败。	Failure of target recognition when the human–robot teams perform assembly can lead to the following problems: Assembly errors: failure to accurately identify the target leads to assembly errors, affecting product quality. Progress delays: failed identification requires repeated identification or manual intervention, extending assembly time. Quality issues: incorrect identification can lead to failed assemblies, increasing rework or scrap rates. Wasted resources: recognizing failures increases time and material waste and reduces efficiency. Safety risks: misidentification may lead to collision or incorrect assembly, posing a safety risk. Collaboration issues: robots are unable to provide accurate information to operators, affecting teamwork. These issues can affect assembly quality, efficiency and safety, and even lead to mission failure.
ChatGPT (GPT-3.5)	当人机团队执行装配目标识别失效时，可能会遇到以下问题：传感器精度不足：传感器可能未能提供足够的细节或准确性，导致目标识别失败。环境干扰：光线变化、遮挡物或杂乱的背景可能干扰目标识别系统。算法误差：目标识别算法可能未能准确识别目标，或存在误识别、漏识别的情况。目标定位错误：目标的定位信息可能不准确，导致后续装配过程出错。硬件故障：摄像头、激光雷达等设备故障，导致无法获得正确的图像或数据。不一致的目标状态：目标可能处于不同的姿态或状态，导致系统无法识别或跟踪。这些问题可以通过优化传感器配置、改进算法、提高环境适应性等方式解决。	The following problems may be encountered when human–robot teams perform assembly target recognition failures: Insufficient sensor accuracy: sensors may fail to provide sufficient detail or accuracy, resulting in target recognition failure. Environmental interference: changes in light, occlusions, or cluttered backgrounds may interfere with the target recognition system. Algorithm error: the target recognition algorithm may fail to recognize the target accurately, or there may be misrecognition or omission of recognition. Target localization error: the target’s localization information may be inaccurate, leading to errors in the subsequent assembly process. Hardware failure: equipment such as cameras, LIDAR, etc. fail, resulting in failure to obtain the correct image or data. Inconsistent target states: targets may be in different attitudes or states, resulting in the system being unable to recognize or track them. These problems can be solved by optimizing sensor configurations, improving algorithms, and improving environmental adaptability.
Specialized LLM (proposed)	问题包括：找不到目标：未能成功识别或接触到预定的目标；找到多目标：识别并定位到多个预定目标；目标定位失败：未能成功地识别、定位到预定的目标位置；二维码无法识别：无法通过扫描二维码获取必要的数据或信息。	Problems include: Failure to find target: failure to successfully identify or reach a predetermined target. Finding multiple targets: identifying and locating more than one intended target. Failure to locate target: failure to successfully identify and locate the intended target position. QR code not recognized: unable to obtain the necessary data by scanning the QR code.
ChatGLM-6B	作为专家系统，分析人机团队执行装配目标识别失效时可能的问题，可以帮助制定有效的应对措施，确保任务顺利完成。可能的问题包括：团队成员沟通不畅，导致任务执行有误。机器设备出现故障，导致任务无法顺利完成。人类员工缺乏必要的技能或知识，导致任务执行失败。任务目标识别不准确，导致任务执行失败。机器设备或人类员工受到外部干扰，导致任务执行失败。通过分析这些问题，可以制定相应的应对措施，以确保人机团队执行装配目标识别任务时能够顺利进行，并确保任务顺利完成。	As an expert system, analyzing the possible problems when human–robot teams perform assembly target identification failures can help develop effective countermeasures to ensure successful task completion. Possible problems include: Miscommunication among team members, leading to errors in task execution. Malfunctioning machinery and equipment, resulting in tasks not being completed successfully. Human employees lacking the necessary skills or knowledge, resulting in task failure. Inaccurate identification of task objectives, resulting in task execution failure. Machine equipment or human employees are subject to external interference, resulting in task execution failure.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Xue, S.; Zhang, H.; Wang, C.; Fu, Y. A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration. Aerospace 2025, 12, 325. https://doi.org/10.3390/aerospace12040325

AMA Style

Wang H, Xue S, Zhang H, Wang C, Fu Y. A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration. Aerospace. 2025; 12(4):325. https://doi.org/10.3390/aerospace12040325

Chicago/Turabian Style

Wang, Hao, Shuqi Xue, Hongbo Zhang, Chunhui Wang, and Yan Fu. 2025. "A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration" Aerospace 12, no. 4: 325. https://doi.org/10.3390/aerospace12040325

APA Style

Wang, H., Xue, S., Zhang, H., Wang, C., & Fu, Y. (2025). A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration. Aerospace, 12(4), 325. https://doi.org/10.3390/aerospace12040325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Human–Robot Team Knowledge-Enhanced Large Language Model for Fault Analysis in Lunar Surface Exploration

Abstract

1. Introduction

2. Related Work

2.1. Human–Robot Collaborative Design and Failure Recovery in Lunar Surface Exploration

2.2. Joint KG and LLM Systems

2.3. Human–Robot Interface Design in Collaboration with Intelligent Agents

3. Methodology of KG-Enhanced LLM for Lunar Surface Exploration

3.1. KG-Based P-Tuning Dataset Construction

3.2. KG Enhances LLM’s P-Tuning Approach

3.3. Multi-Level Prompt Design for Dialogue

4. Case Study

4.1. Experimental Setting

4.2. Experimental Results and Analysis

4.3. Usability and Feasibility Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI