Next Article in Journal
Dealing with Stationary Sinusoidal Responses of Seven Types of Multi-Fractional Vibrators Using Multi-Fractional Phasor
Previous Article in Journal
A Study of the Neutron Skin of Nuclei with Dileptons in Nuclear Collisions
Previous Article in Special Issue
Adaptive Finite-Time Prescribed Performance Control of Nonlinear Power Systems with Symmetry Full-State Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving

Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
Symmetry 2024, 16(9), 1196; https://doi.org/10.3390/sym16091196
Submission received: 7 August 2024 / Revised: 6 September 2024 / Accepted: 8 September 2024 / Published: 11 September 2024

Abstract

:
Large language models (LLMs) are widely integrated into autonomous driving systems to enhance their operational intelligence and responsiveness and improve self-driving vehicles’ overall performance. Despite these advances, LLMs still struggle between hallucinations—when models either misinterpret the environment or generate imaginary parts for downstream use cases—and taxing computational overhead that relegates their performance to strictly non-real-time operations. These are essential problems to solve to make autonomous driving as safe and efficient as possible. This work is thus focused on symmetrical trade-offs between the reduction of hallucination and optimization, leading to a framework for these two combined and at least specifically motivated by these limitations. This framework intends to generate a symmetry of mapping between real and virtual worlds. It helps in minimizing hallucinations and optimizing computational resource consumption reasonably. In autonomous driving tasks, we use multimodal LLMs that combine an image-encoding Visual Transformer (ViT) and a decoding GPT-2 with responses generated by the powerful new sequence generator from OpenAI known as GPT4. Our hallucination reduction and optimization framework leverages iterative refinement loops, RLHF—reinforcement learning from human feedback (RLHF)—along with symmetric performance metrics, e.g., BLEU, ROUGE, and CIDEr similarity scores between machine-generated answers specific to other human reference answers. This ensures that improvements in model accuracy are not overused to the detriment of increased computational overhead. Experimental results show a twofold improvement in decision-maker error rate and processing efficiency, resulting in an overall decrease of 30% for the model and a 25% improvement in processing efficiency across diverse driving scenarios. Not only does this symmetrical approach reduce hallucination, but it also better aligns the virtual and real-world representations.

1. Introduction

The rapid advancement of independent driving advances heralds another period in the subsidiary field of sudden loss prevention for intelligent transport systems. This expands upon the technology that powers these self-driving cars, including creating large language models (LLMs) to make them even more capable and secure. This is especially important for electric vehicles and smart grids, the cornerstone of future transportation. Self-driving involves advancing from assisted driving to an autonomous car, which exists in two different types of technologies. One essential part is balancing incorporating autonomous systems with electric vehicles and intelligent power systems. This symmetry leads to balanced resource utilization, which helps optimize performance and energy consumption in the changing canvas of intelligent transportation [1].
This balance involves a process of transitioning from assisted driving to autonomy, which straddles two different technological paradigms. The first is the development of autonomous driving systems that rely on complex algorithms and machine learning to recognize and navigate dynamic situations. Electrification (especially the integration of these systems with electric vehicles and smart grids) adds new problems and opportunities for integrated energy management across infrastructure elements [2]. A critical part of this integration is energy-efficient resource utilization while maintaining peak performance. It is a balance that will be ever more crucial as we approach a future in which transportation systems need to drive themselves, and do so with maximum energy efficiency and minimal environmental impact. The symmetry between the computational complexity of autonomous driving and energy considerations in smart grids results in mutually enabling use cases that push innovation forward on both sides. In smart systems like autonomous driving, co-regulation and optimization approaches require integrating various models and systems to ensure safe and efficient operations. Specifically, autonomous driving systems typically employ multi-model approaches, combining machine learning models, sensor networks, and control algorithms to address different operational challenges [3]. In large language model (LLM)-based autonomous driving, hallucinations [4] are phenomena in which the system perceives or interprets objects or features in the environment that are not present. These misperceptions can lead to poor decision-making and potentially dangerous situations [5]. Hallucinations can be described as the inconsistency between the textual response generated by the multimodal model and the corresponding visual content, resulting in missing objects, i.e., objects that are not present in images, or incorrect statements. Addressing the problem of hallucinations in autonomous driving not only improves the accuracy of the generated content and enhances user trust, thereby optimizing the user experience, but also enhances LLMs’ effectiveness in a specific domain. By ensuring that the large language models (LLMs) accurately reflect the real-world environments, the safety and efficiency of these systems can be significantly improved.
Hallucination in existing multimodal LLMs: After analyzing existing multimodal models [6], it is evident that hallucinations manifest in multimodal LLMs as discrepancies between generated textual responses and corresponding visual content. These kinds of hallucinations usually focus on object illusions, which can be subdivided into object categories, attributes, and relation errors [7]. In fact, during the whole experiment of multimodal language modeling, data, modeling, training, and reasoning all produce hallucinations.
There are lots of methods to handle the hallucinations in autonomous driving [8], like multi-sensor fusion (which can fuse different types of sensor data (lidar, cameras, and radar) to improve perception accuracy). The single point of failure or accuracy problems on a sensor can be reduced when using multi-sensor fusion. Augment data sets and training: Train the model with more high-quality labeled data to help improve the generalization ability of a model, as well as reduce hallucinations when confronted by novel scenes. Generated or synthetic data from techniques like simulated generation can further enhance the diversity of training samples. Adversarial training: training the model through adversarial examples (carefully designed perturbed samples) fosters robustness to input noises hence diminishing hallucinations. Detroit Development and Testing Robustness Testing—tests the autonomous driving system under a wide range of extreme conditions to expose and correct vulnerabilities. For instance, evaluate system behavior with reduced illumination, adverse weather conditions, or under sensor defects. Model calibration and validation: the perception model needs to be updated regularly and exposed on field three. Furthermore, model validation techniques help ascertain the correct and robust behavior of the models across different scenarios.
Self-reflection methods: For example, in [9], a self-reflection approach is suggested to mitigate hallucinations. This three-loop process is an iterative way to keep the model running in its cycle of acting and learning. This example is characterized by its looping structure, which corresponds to a convergence of generation, evaluation, and refinement within the loops.
The model bootstraps background knowledge using the input questions in the first loop and then uses a custom scorer to evaluate generated answers. The score then undergoes a further filtering process where improved background knowledge is obtained if the value falls below the set threshold. This ‘Generate-Score-Perfect’ strategy is symmetric in that knowledge generation and evaluation operate around a reference point, steadily converging to some level of truthiness. This equilibrium eventually triggers the resultant second loop. On the second pass, the model responds using its newly revised understanding. This is reached by checking the consistency score of the answer against some threshold. If this consistency falls below a certain range, then it applies self-correction and repair on output. This Generate–Score–Improve loop mirrors the symmetry of the first stage, where creation and evaluation are in equilibrium until they level out at a desired veracity. The last loop comes after performing these two loops, where the model evaluates the answers to check their entailment and relevance. With its symmetric feedback, this loop perfectly closes the circle of iteration. It helps enforce that any answers insufficient to satisfy demands automatically lead back into stage one of the framework—further fueling this spawning ground for exploration.
Besides researching hallucination problems in autonomous driving, we should pay more attention to symmetrical balance consumption for computing resource usage [10]. We need to deploy and use large language models effectively and find a good balance of controlling hallucinations versus higher overall performance by using this powerful technology. For instance, symmetrical optimizations of algorithm and data management or system design can render calculation resource constraints in autonomous driving systems and hence manage hallucination problems resulting from the overuse or wrong usage of computation while maintaining balance. These methods improve system performance and safety and balance optimization with resource consumption to reduce system development costs and facilitate this type of operation.
In this paper, we propose a joint hallucination reduction [11] and optimization framework to improve autonomous driving systems’ reliability, efficiency, and security by addressing these aspects through advanced cyber-physical co-conditioning and optimization methods [12]. This paper aims to develop a multi-model LLM that integrates Visual Translator (ViT) for image encoding, GPT-2 for decoding, and GPT-4 [13] for generating answers, all of which are augmented by PyTorch’s DistributedDataParallel for efficient processing. We evaluate our model using accuracy, F1 score, ChatGPT evaluation, and matching score. To address the hallucination problem, we conduct some analysis on some surveys about the hallucination of multimodel LLMs [14]. Finally, we decide to apply reinforcement learning from human feedback (RLHF) and automatic metrics such as BLEU, ROUGE, and CIDEr. Subsequently, based on the evaluation results, we optimize the model by evaluating the balance between the use of computational resources and illusion-related limitations.
To effectively optimize the model to reduce hallucinations while managing inference latency and resource constraints, it is first necessary to define variables and construct an objective function that balances latency, resource usage, and hallucination constraints. Setting explicit computational resource constraints and acceptable levels of illusion, we use the gradient descent method of the optimization algorithm to find the best configuration [15]. The parameters are adjusted according to the performance evaluation results and repeated until the model satisfies all constraints and performance criteria. Finally, the experimental process and results are thoroughly documented to gain insights and guide further refinement. Our contributions can be summarized as follows:
  • Different from current generative models that often produce inaccurate information when dealing with complex linguistic tasks, which affects the model’s usefulness and can also be seriously misleading, we develop a model incorporating multiple iterative loops to reduce illusions. This approach enhances the model’s ability to minimize errors associated with misreading.
  • While optimizing the model performance, the limitation of computational resources should also be considered. Therefore, to reduce the use of computational resources under the premise of ensuring the accuracy of the model, which can improve the efficiency of the model and reduce the cost in practical applications, we optimize the model to minimize the use of resources while effectively solving the illusion problem, thus improving the overall performance.
  • Increasing the number of iterations to reduce hallucinations effectively solves the hallucination problem in LLMs. Controlling the illusion phenomenon while ensuring that the model generates high-quality text can significantly improve the reliability and usefulness of the model in practical applications. This approach excels in evaluation metrics and demonstrates outstanding performance in practical applications. We effectively solve the hallucination issue in LLMs by increasing the number of iterations dedicated to reducing hallucinations. Under the constraints of generating tokens and using hallucination evaluation metrics, we successfully control hallucinations and optimize overall performance in these models.
This paper is structured as follows: we discuss the application of large-scale language models to autonomous driving, citing relevant research and methods; we describe existing frameworks for phantom reduction and discuss how these frameworks perform in different applications; we present relevant research on resource optimization, emphasizing optimization methods in the context of constrained computational resources; we present selected datasets, explaining the rationale for their selection and their properties; and we analyze and discuss experimental data.

2. LLMs for Autonomous Driving

The conventional modular autonomous driving pipeline has three symmetrical components: perception, prediction, and planning [16]. In this setup, the planning phase uses the output of perception to decide what future vehicle trajectory will look like, which is then enabled by a control system. It helps to smooth the flow of information from sensing the outside world to making clear decisions and reactions. LLMs like GPT-4 (Achiam et al., 2023) [13] and Llama2 (Touvron et al., 2023) [17] have made impressive strides towards generalization and passing standardized reasoning benchmarks alongside rapid advancements on aspects of artificial intelligence, such as vision. Current studies also illustrate their implementation for the upcoming autonomous driving planning by blending scene data with linguistic patterns to make decisions.
Our integration provides a symmetry that supports the challenge faced in navigating complex scenes and producing consistent, context-aware planning. The advantages are more bilateral when compared to the traditional AD model based on labour-intensive, hand-labelled data that is often not inclusive. In the context of a whole system, training these models on diverse and large-scale datasets should encourage them to generalize better by making them understand all the different cases they might encounter while driving. LLMs have provided high inferencing capability and extensive pre-trained knowledge, which can popularly replace intricate heuristic rule-planning systems efficiently to simplify the planning process without hindering its adaptability. Said otherwise, it is a balanced decrease in the system’s complexity and increased functional versatility. In some cases, generative models help generate more realistic traffic scenes to aid in simulation for safety and reliability evaluation of rare or difficult situations. This capability adds symmetry between the real world and virtual worlds for simulations, making it possible to ensure that the model has been trained well across a broad spectrum of conditions by testing them in such scenarios. Moreover, such models can comprehend and execute complicated natural language commands that enable user-friendly interaction with autonomous driving technologies. This forms a scenario of bi-directional human–robot symmetric interaction, improving user interface and building user trust. First, they revolutionized the NLP domain and now are advancing the AD space using LLMs. Bidirectional Encoder Representations from Transformers (BERT) [18] presented the functional NLP transformer architecture for semantic language comprehension. It has been pre-trained on a large corpus so that it can be fine-tuned using data of the smallest sizes and can perform at state-of-the-art levels in various tasks. Subsequent models, like GPT-4 [13], have been able to generalize and learn new skills with minimal context. These reasoning, comprehension, and contextual learning capabilities are now widely used, thereby enabling bi-directional reinforcement: on the one hand, advancing research within NLP by focusing on how these improvements can be applied to AD, and on the other, boosting clinical advances in understanding this condition itself. When we add LLMs to autonomous driving systems, the perception -> prediction -> planning loop now proceeds in a virtually antiphase manner. These integrations also help control the ordeal of human-labeled data, which is time-consuming, and add flexibility to models while making them more efficient at extracting features from varied and complex scenarios. Finally, the iteration and adaptation process is symmetrical, resulting in continuous improvement as a healthy conflict with real-world constraints, eventually enabling a higher level of safety in the future.
LLMs have recently received much attention and have shown extraordinary potential for modeling human-like intelligence. These advances have fueled enthusiasm for multimodal LLMs (MLLMs) [19], which combine the sophisticated reasoning capabilities of LLMs with image, video, and audio data. Modal alignment allows them to perform tasks more proficiently, including image classification, matching text to corresponding video, and speech detection. Furthermore, the literature demonstrates [20] that LLMs can handle simple tasks within the robotics domain, including basic logical, geometric, and mathematical reasoning, as well as complex tasks such as air navigation, manipulation, and embedded agents. However, integrating LLMs into the transportation and self-driving cars field is still in the pioneering stage. Combining verbal communication with multimodal sensory inputs such as panoramic images, LIDAR point clouds, and driving maneuvers could revolutionize the current base model of autonomous driving systems. This integration introduces a symmetrical interaction between verbal instructions and sensory data, creating a balanced system that enhances decision-making processes.
In autonomous driving, LLMs have a transformative impact on key modules such as perception, motion planning, and motion control [21]. Regarding perception, LLMs can utilize external APIs to access real-time text-based information sources such as high-definition maps, traffic reports, and weather updates, enabling vehicles to view their surroundings comprehensively [22]. This creates a symmetry between the vehicle’s real-time data acquisition and its understanding of the environment, enhancing situational awareness. For instance, one initiative has been to upgrade how well in-car maps can navigate. The LLMs can segment real-time traffic data, detect crowded paths, and propose multiple situations of reasonable deviation, thereby refining route optimization and navigation safety [23]. This demonstrates a harmonious blend of real-time data processing and route planning, meaning the vehicle can react organically so long as it does not drive through walls.
Indeed, the natural language understanding and reasoning abilities of LLMs can be a crucial component in motion planning [24]. They make passenger-centric communication possible, taking inputs from passengers through natural language. A user experiencing this type of interface engages in a two-way exchange between inputs and outputs, resulting in a more natural experience with less latency. LLMs can also interpret textual data sources (e.g., maps, traffic reports, real-time information) to generate high-level decisions for optimized route planning [25], a multi-layered optimization strategy of routing as well. Regarding motion control, LLMs can adjust controller gains to match driver preferences more effectively [26]. This customization creates symmetry between driver preferences and vehicle behaviour, reinforcing a customizable experience and the joy of driving. LLMs help explain every step of the motion control process and thus provide transparency into what is happening between system actions concerning user expectations. Compared with the previous simple hallucination reduction model, we apply the self-reflection method in our model; we increase the number of RLHF rounds and use three metrics, BLEU, ROUGE, and CIDEr, to help us evaluate each round and find out the best number of rounds to reduce hallucinations.
In general, LLMs frequently generate outputs that are fabricated, irrelevant, or inconsistent with the input data; this kind of situation leads to the distrust of users and potential misinformation scenarios. Many of these hallucinations arise from the quality and bias present in the training data, as well as the inherent limitations of the models’ interpretative frameworks. Our paper proposes a robust and comprehensive approach that addresses these gaps through the development of a symmetrical framework that maps between physical and virtual spaces using LLMs. After recognizing that hallucinations often occur during this mapping process—leading to discrepancies between visual content and textual outputs—our approach introduces a systematic solution designed to mitigate these misjudgments.

3. Hallucination Reduction Framework

As the inputs for our models are question–answer pairs with the images that come from the Reason2Drive dataset and DriveLM dataset, as shown in Figure 1, we use Visual Translator (ViT) as an encoder for image captioning and the GPT-2 model as a decoder. We also use OpenAI’s GPT-4 model to generate question–answers based on image captions and other hints you provide. We also utilize PyTorch’s DistributedDataParallel (DDP) to parallelize processing on multiple GPUs, enabling the efficient processing of large datasets and complex models. We use three assessment methods in the evaluation section: accuracy, ChatGPT score, language assessment, and matching score. The final score is a weighted average of the three metrics.
  • Accuracy: Measures the percentage of answers that match the underlying facts. It is a straightforward metric used to evaluate model performance, using the formula:
    Accuracy = True Positives + True Negatives Total Instances
  • F1 score: Evaluates precision and recall based on extracted values. It is calculated using the harmonic mean of precision and the result is as follows:
    F 1 = 2 · Precision · Recall Precision + Recall
    Precision is defined as:
    Precision = True Positives True Positives + False Positives
    Recall (also known as sensitivity) is defined as:
    Recall = True Positives True Positives + False Negatives
  • ChatGPT Evaluation: Further evaluation using custom metrics in the GPTEvaluation class. These custom metrics, like BLEU, ROUGE, and CIDEr, will be introduced in the following parts.
Hallucination reduction framework: To reduce hallucination, we employ RLHF, a machine learning (ML) technique that leverages human feedback to optimize ML models for more effective self-learning. The Figure 2 shows some examples show how RLHF help us reduce hallucination. This approach ensures that the models learn from human corrections and improve their performance over time.
We use the ablation studies [27] to help us find the difference between the LLMs trained with RLHF and those trained without it. In scenarios without RLHF, models produce outputs that are factually incorrect due to their inability to discern between accurate and fabricated information. In contrast, LLMs trained with RLHF can correct some errors and make some improvements with complex human preferences. The ablation studies indicate that the inclusion of RLHF results in reduced hallucination rates and improved performance across various tasks. We also set different iterations of RLHF, and find out the differences significantly affect the hallucination mitigation process. Each iteration represents an opportunity for improved alignment with human expectations and reduced error rates, making RLHF a dynamic and progressively refined approach to addressing hallucinations.
Through this comprehensive framework that incorporates ablation studies, we can illuminate the crucial differences between employing RLHF and traditional methods, enhancing our understanding of methods to combat hallucinations in LLMs.
In the meantime, we also add user feedback mechanisms through iteration in RLHF; this is very crucial for continually refining AI models. By applying this mechanism, it enhances the model’s ability to adapt the real-world applications and align with user experiences. Our user feedback mechanisms involve systematically collecting and analyzing information from users about their interactions with AI models. In the context of RLHF, this mechanism enables AI systems to incorporate human sights directly into their learning processes. By analyzing user feedback, AI models can adjust their behaviors and outputs to better align with user preferences and expectations, leading to improved performance and satisfaction.
Figure 2. These examples show how the RLHF helps us reduce the hallucination for the text. All the pictures are from HAD [28], and in the answer generated by our model, we use yellow color to highlight the hallucination part. Firstly, the question–answer pairs and pictures will be combined as inputs to send to our multi-model. Then, our model will generate an original answer. After that, based on this original answer, we use RLHF to reduce the hallucinations in the answer. Comparing these two answers, we find that the hallucination, highlighted in yellow, was reduced in the answer after applying RLHF.
Figure 2. These examples show how the RLHF helps us reduce the hallucination for the text. All the pictures are from HAD [28], and in the answer generated by our model, we use yellow color to highlight the hallucination part. Firstly, the question–answer pairs and pictures will be combined as inputs to send to our multi-model. Then, our model will generate an original answer. After that, based on this original answer, we use RLHF to reduce the hallucinations in the answer. Comparing these two answers, we find that the hallucination, highlighted in yellow, was reduced in the answer after applying RLHF.
Symmetry 16 01196 g002
In addition to RLHF, we implement several automated assessment metrics to measure hallucination. These metrics include BLEU, ROUGE, and CIDEr. Using three metrics, BLEU, ROUGE, and CIDEr, can help evaluate and improve the quality and accuracy of generated text. Although these metrics are mainly used for text generation tasks in natural language processing, their application can also be extended to autonomous driving systems, especially in situations involving natural language descriptions, instructions, or system feedback.
BLEU (Bilingual Evaluation Understudy) can be used in autonomous driving to evaluate the degree of match between the natural language descriptions generated by the system (such as navigation instructions) and the expected or standard descriptions, ensuring that the generated descriptions are accurate.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is suitable for evaluating the recall and comprehensiveness of the generated natural language outputs in autonomous driving, such as checking whether the generated dialogues or instructions fully cover the expected information.
CIDEr (Consensus-based Image Description Evaluation) helps us evaluate image descriptions, measuring the consistency of generated descriptions with multiple reference descriptions, focusing on the descriptions’ semantic consistency and information content. Although CIDEr was initially designed for image description tasks, in autonomous driving, it can be used to evaluate the quality and semantic consistency of system-generated descriptions to ensure that the generated descriptions can effectively convey the system’s intentions.
These metrics can help ensure that the autonomous driving system’s language output is accurate and consistent with the user’s expectations, thereby reducing hallucinations that could lead to misunderstandings or incorrect decisions. Here are how these three metrics are defined:
  • BLEU: The BLEU metric measures how much n-gram overlap occurs between a generated answer and reference answer, by applying a Brevity Penalty (BP) combined with the correct weighting for n-grams of varying lengths using their famous formula. This metric measures how well the generated text lines up with reference test at a given n-gram level. We can define BLEU as:
    BLEU = B P · exp n = 1 N w n log p n ,
    where B P is the Brevity Penalty, p n is the precision for n-grams of length n, w n is the weight for n-grams of length n, typically set uniformly so that n = 1 N w n = 1 and N is the maximum n-gram length, commonly set to 4 (considering up to 4-grams).
  • ROUGE_L: The ROUGE_L metric considers the longest common subsequence (LCS) between the generated and reference answers, focusing on three main components: precision, recall, and F-measure. Precision ( P L ) is calculated as the length of LCS divided by the length of the generated text. Recall ( R L ) is calculated as the length of LCS divided by the length of the reference text. The F-measure ( F L ) balances precision and recall, typically using a β parameter to adjust the recall weight relative to precision. Initially, precision P L can be estimated as
    P L = LCS ( X , Y ) | Y | ,
    where LCS ( X , Y ) is the length of the longest common subsequence between the reference text X and the generated text Y and | Y | is the length of the generated text Y.
    In addition, recall R L can be calculated as
    R L = LCS ( X , Y ) | X | ,
    where LCS ( X , Y ) is the length of the longest common subsequence between the reference text X and the generated text Y and | X | is the length of the reference text X.
    Finally, F-measure F L can be calculated as
    F L = ( 1 + β 2 ) · P L · R L R L + β 2 · P L ,
    where β is a parameter that determines the weight of recall relative to precision. When β = 1 , recall and precision are equally weighted, and the F-measure becomes the F1 score.
  • CIDEr: The CIDEr metric focuses on the consensus and commonality of n-grams, assessing how well the generated text matches the reference text in terms of n-gram precision. It averages the CIDEr scores for the considered n-grams to provide a comprehensive evaluation of the generated text, which can be estimated as
    CIDEr = 1 | C | i = 1 | C | CIDEr i ,
    where CIDEr i is the CIDEr score for the i-th n-gram precision and | C | is the number of n-grams.
For the hallucination reduction component of our framework which shows in Figure 3, the optimization process in RLHF comprises four main components: input encoding, reward calculation, loss calculation, and gradient descent. In input encoding, the input answer is transformed into token IDs. Reward calculation involves computing a reward based on these token IDs. Loss calculation defines the loss as the negative mean of the rewards. Lastly, gradient descent is where the optimizer updates the model parameters to minimize the loss. These components can be represented mathematically to facilitate the process. The encoding function E ( a ) converts an input answer a into token IDs. The reward function R ( E ( a ) ) computes rewards based on the encoded input. The loss function L ( R ) computes the loss as L ( R ) = 1 N i = 1 N R i , where N is the number of tokens.
To provide more details about the mathematical function pseudocode, let us define the following terms: a i represents the i-th answer, and G T i is the ground truth corresponding to a i . We can summarize the optimization process with the following steps:
First, we encode the input a i into token IDs using an encoding function E:
x i = E ( a i ) ,
where x i is the encoded representation of a i .
Next, we compute the reward r i for the encoded input:
r i = R ( x i ) ,
where R is the reward function that evaluates the encoded input x i and provides a reward score r i .
We then define the loss function L as the negative average of the rewards across all samples:
L = 1 N i = 1 N r i ,
where N is the total number of samples. The goal is to minimize this loss function.
To update the model parameters θ , we use the optimizer to adjust θ in the direction that reduces the loss. The learning rate η controls the step size of each update. The update rule for θ is:
θ θ η L θ .
Here, L θ represents the gradient of the loss function L with respect to the model parameters θ .
Combining the above steps, the optimization function can be expressed as:
θ θ η L θ = θ η 1 N i = 1 N R ( x i ) θ .
Here, R ( x i ) θ is the gradient of the reward function with respect to the model parameters.
After simplification, the final optimization function is:
θ θ + η 1 N i = 1 N R ( E ( a i ) ) .
This formula shows that we adjust the model parameters θ by incorporating the average reward of all samples to improve the overall performance of the model.
After simplifying the function, our final optimization function is
θ θ + η 1 N i = 1 N R ( E ( a i ) ) .
The pseudocode for the hallucination reduction code is shown in Algorithm 1.
Algorithm 1 Optimize Policy with RLHF
  • /* Start of Hallucination */
  • For each record in data:
  •     answer ← record.answer
  •     GT ← record.GT
  •     input_ids ← call tokenizer.encode(answer, return_tensors=‘pt’)
  •     rewards ← call reward_model(input_ids)
  •     loss ← - mean(rewards)
  •     call optimizer.zero_grad()
  •     call loss.backward()
  •     call optimizer.step()
  •     return “Batch”, record.index, “: Loss = ”, loss.item()
  • /* End of Hallucination */

3.1. Data Preparation and Baseline Implementation

The data from DriveLM [29] and Reason2Drive [30] are both saved in JSON files. We download the images from nuScenens and convert the data from a JSON file into a different format with image paths and text data. Our model uses this new JSON file, which contains only scenes with valid images.
We implement a distributed image and analysis system using multiple GPUs. It processes image datasets and related questions to generate image captions and detailed responses using OpenAI’s GPT -4 and ViT models.
Data Preparation: Our data is saved in a JSON file in the setup step. Each data contains an image filename, a question, an answer, and a unique ID.
Image Captioning and Question Answering: The script loads a pre-trained Vison Transformer model for generating image captions. Captions are generated for each image, and detailed question prompts are created. An asynchronous function sends the prompt to OpenAI’s GPT-4 model and retrieves the response. Finally, we can obtain an output JSON file, which is used for the evaluation.
Accelerate the inference time: To accelerate the inference times, first, we carry out some data cleaning; in this way, we can ensure that all the data used in the following experiment are efficient by avoiding unexpected errors. Additionally, by performing the data conversion, this will reduce the processing time during inference by converting the data into a format that can be used directly by the model. Secondly, we set the batch size to optimize the inference process. Larger batch sizes usually increase throughput but may increase latency, so we set the appropriate batch size based on the specific hardware resource and application requirement to facilitate the balance between inference speed and resource utilization. Finally, we use parallel processing, using multiple processes that allow for data to be processed in parallel, thus increasing the overall speed of reasoning. We choose a proper processes number based on the specific hardware configuration and model characteristics to avoid resource competition and reduce efficiency instead.

3.1.1. Evaluation

We use two answers to carry out the evaluation; one is from the original dataset which is added manually, and the other one is from the output.json file from our baseline, the answer to which is generated by our MLLMs. In the evaluation section, we evaluate the language model using a variety of metrics (accuracy, linguistic metrics, ChatGPT-based evaluation, and matching evaluation) and optimize the model using reinforcement learning from human feedback. We also implemented a number of automated evaluation metrics: BLEU, ROUGE, and CIDEr, to help us measure the level of hallucination. This process consists of multiple rounds of evaluation and optimization with the goal of incrementally improving the performance of the model.

3.1.2. Challenges

At the beginning of the experiment, we try to analyze the input images using a simple image analysis model. However, the results did not meet our expectations; a single application of the image analysis model can only identify the objects in the picture, not a more detailed description [31]. Such a simple result can not be used in the evaluation process, so we add another model to the image analysis model to generate prompts. We can only use it to contrast with the answers in the original dataset for evaluation if we generate more details through image analysis.

3.2. Resource Optimization for Autonomous Driving

This optimization problem aims to minimize the inference latency, considering the computational resource constraints and the need to reduce illusions. To formalize this problem, we can model it in the following ways: first, define the variables: L represents the inference latency (the objective to be minimized), R is the available computational resources, such as token numbers and memory (known quantities), and H denotes the probability or degree of the hallucination (the variable to be controlled).
The objective function is to minimize the inference latency L, i.e., min L. The constraints include resource constraints and hallucination constraints. The resource constraint is f ( R ) R max , where f(R) denotes the resources required for the model to run, and R max is the upper limit of available resources. The hallucination constraint is g ( H ) H threshold , where g(H) denotes the evaluation function of the hallucination, and H threshold is the upper limit of acceptable hallucination. We can also choose different model configurations to balance resource use and hallucination control: more complex models may require more resources but may be more accurate and reduce hallucinations; simplified models may reduce resource consumption but increase the risk of hallucinations.
This optimization problem can be solved using the Lagrange multiplier method, combining the objective function and constraints. The Lagrangian function is defined as L ( L , R , H , λ , μ ) = L + λ ( f ( R ) R max ) + μ ( g ( H ) H threshold ) , where λ and μ are Lagrange multipliers used to balance the resource and hallucination constraints. Solving this optimization problem requires iteratively testing different model configurations and parameters until an optimal setup minimizes the latency under the specified resource and hallucination constraints. In our model, we use gradient descent to find the optimal solution. Initially, we need to initialize the data in our model: θ (randomly initialized model parameters as a vector of length three), λ and μ (initial values for Lagrange multipliers), α (initial learning rate for gradient descent), and the decay rate for learning rate over iterations. Based on these initializations, we define the objective and constraint functions: L ( θ ) (the objective function to minimize, which is the sum of squares of θ ), f ( θ ) (constraint function related to resources), and g ( θ ) (constraint function related to evaluation data). Additionally, we set constraints such as R max (maximum allowable resource usage) and H threshold (threshold for evaluation data). Gradients for the objective function and constraints are calculated, and the optimization loop iteratively updates θ using gradient descent, adjusting θ to satisfy resource and evaluation constraints if violated. The Lagrange multipliers λ and μ are updated based on constraint violations, and the learning rate α decays over time. Convergence is checked using a tolerance threshold. Through these steps, the model parameters θ are optimized while ensuring that the resource use and evaluation index constraints are met, resulting in an efficient and effective optimization process.
The pseudocode for model optimization is shown in Algorithm 2.
Algorithm 2 Optimization Algorithm
  • Initialize θ with 3 random values
  • Set λ 1.0
  • Set μ 1.0
  • Set α 0.000005
  • Set decay_rate 0.995
  • Set tolerance 1 × 10 6
  • Set max_iter 10,000
  • Define function L(theta): L θ
  • θ 0 2 + θ 1 2 + θ 2 2
  • Define function f(theta): f θ
  • θ 0 + θ 1 + θ 2
  • Define function g(theta): g θ
  • θ 0 2 + θ 1 2 + θ 2 2
  •  Set R max 138
  •  Set H threshold 0
  •  Define function gradients:
  •    Set d L _ d t h e t a 2 θ
  •    Set d f _ d t h e t a array [ 1 , 1 , 1 ]
  •    Set d g _ d t h e t a 2 θ
  •    Return d L _ d t h e t a + λ d f _ d t h e t a + μ d g _ d t h e t a
  •  Set previous_L L ( θ )
  •  For i = 0 to max_iter 1
  •    Set grad gradients θ , λ , μ
  •    Update θ by subtracting α × grad
         If f ( θ ) > R max
  •     Scale θ by R max f ( θ )
         If g ( θ ) > H threshold
  •     Scale θ by H threshold g ( θ )
  •    Update λ by adding α × ( f ( θ ) R max )
  •    Update μ by adding α × ( g ( θ ) H threshold )
  •    Scale α by decay_rate
  •    Set current_L L ( θ )
  •     If | current_L previous_L | < t o l e r a n c e :
  •     return Converged at iteration i;
  •     break
  •    Set previous_L ← current_L
  • return θ , λ , μ .

3.3. Iteration Loop

In this post, we explore the effect of employing an iteration loop to reduce hallucinations following our hallucination reduction model. We can improve the self-reflection capability of the model, as we discuss in our summary. Every iteration performs one complete symmetric cycle of training or evaluation in which the model generates and collects feedback and adjusts itself. At each iteration, the model produces some outputs and obtains input from these outputs to change, and then updates its parameters accordingly. This process becomes an iterative but symmetric loop, which means every cycle aims to slowly progress towards the desired generation by reducing hallucinations. Symmetrically, it repeats the generation-refinement alternation and balances how well the model output is framed by feedback. The models that reported the computed scores are specialized automated evaluation metrics (e.g., BLEU, ROUGE, CIDEr) and captured issuing. In each of these iterations, the model will receive a set of evaluated and scored outputs. This feedback, born from human reviewers or pre-defined evaluation criteria, tunes the model parameters. This iterative symmetry guarantees that adjustments are made to fit the scores of students from one round into the model and keeps pulling it towards evaluation excellence. In the increasing round of iterations, the model is iterated symmetrically to gradually improve scores instead of reflecting fixed scoring, because such symmetry used during the iterative process over competing evaluation criteria would be better, as it means continuous improvement and better alignment with its evaluation metrics. This trend indicates that the model has sequentially learned better performance, which is a proper way of incurring fewer hallucinations in self-reasoning.

3.4. Formal Methods for AI-Based Technique Verification

Formal methods provide a robust framework for verifying AI-based techniques, ensuring that systems meet rigorous safety, reliability, and performance standards, especially in high-stakes applications like autonomous driving and medical systems. Techniques such as model checking, theorem proving, abstract interpretation, and neural network verification offer mathematical guarantees that AI systems behave as expected under all conditions [32]. Additionally, methods like probabilistic verification and contract-based design enhance the robustness of systems that involve uncertainty or modular architectures. By integrating these formal methods, AI systems can become more transparent, reliable, and compliant with regulatory requirements, ultimately fostering trust and safety in their real-world deployment [33].

3.5. Convex Optimization

As we increase the number of rounds for hallucination reduction in our model, which leads to an increase in resource usage, it becomes crucial to optimize our model to efficiently mitigate hallucination issues given limited resources and constraints. We approach this problem through convex optimization, which embodies a symmetrical balance between model performance and resource efficiency. In convex optimization, there are four methods commonly used:
Gradient Descent: An iterative algorithm minimizes the objective function based on its gradient information with specific steps. Newton’s second-order optimization method uses the Hessian matrix information of objective function, which has higher convergence faster but very high computational complexity.
Interior-Point Method (internal method): This is a compelling way of working with linear and nonlinear planning problems, and it is especially suited for large-scale optimization issues.
Projected Gradient Descent: Used for optimization problems with convex constraints, this method returns the resulting vector from each iteration into a set feasible domain.
We use gradient descent in our optimization. First, we define the variables for optimization, including the configuration and parameters of the model. Constraints encompass both the hallucination of computational resources and the limitations imposed by the demand for reduced hallucinations. The objective of this convex optimization is to minimize inference delay, which can be defined as the time for model inference or other relevant metrics, such as response time.
To achieve this, we construct a comprehensive objective function that symmetrically integrates the efficiency of computational resource usage and the demand for hallucination reduction. We employ advanced optimization algorithms, such as gradient descent, the interior-point method, and genetic algorithms, to find the optimal solution. Each algorithm reflects a balanced approach to exploring the solution space, iterating toward optimal resource usage and performance.
Given the problem’s complexity and the need to test various model configurations and parameters, our optimization process is inherently iterative. Each iteration involves adjusting the model configuration and resolving the optimization problem symmetrically, where every step is designed to balance between reducing inference latency and managing resource constraints. This iterative symmetry ensures that each adjustment is informed by previous results, maintaining a balanced approach to optimization. Once we identify the optimal configuration, we rigorously evaluate its performance. This evaluation checks whether the inference latency meets the requirements, whether computational resource usage remains within acceptable limits, and whether hallucination demands are adequately addressed. Depending on these results, we may need to further refine the optimization problem or to restart the optimization process. This iterative, symmetric approach underscores the thoroughness of our optimization strategy for reducing hallucinations. We provide a detailed account of our experiments and the final results of these iterative optimizations below.

4. Experiments and Results

4.1. Dataset Selection

Recognizing the importance of data in training LLMs, we conduct a thorough analysis of several extensively populated datasets. This rigorous process ensures that we select the most suitable datasets for our research. There are two datasets based on nuScenes: the dataset in Talk2Car [34] is the first object-referenced dataset for self-driving vehicles that includes natural language commands. It comprises 9217 images, 10,519 objects, and 11,959 expressions, with an average expression length of 11.01 words. Another dataset that extends the nuScenes dataset is named nuPrompt [35]. This dataset constructs 35,367 linguistic descriptions, each pointing to an average of 5.3 object traces. One of the linguistic traces describes driving-related objects and provides detailed annotations for objects in driving scenarios. Additionally, two other datasets are based on the Honda Research Institute dataset: Rank2Tell [36], a dataset focused on understanding complex visual scenes in urban traffic scenarios, containing three camera images and point-cloud features. In each scene, five annotators are asked to identify three levels of importance for essential objects—high, medium, and low—and to write natural language descriptions explaining their rationale for ranking the importance of each critical object, which ensures the diversity of interpretative annotations. The second dataset, HAD [28] (Honda Academy—Recommended Dataset), includes more than 5675 video clips (totaling over 32 h). Each 20-s driving video features 4–5 action descriptions (25,549 in total) and 3–4 attention descriptions (20,080 in total). We perform analysis on another two datasets: Reason2Drive and DriveLM; their databases are both generated from The HAD HRI proposal dataset wich contains driving videos from the HDD dataset, which is a large-scale naturalistic driving dataset. These videos were filmed in the San Francisco Bay Area and provide a variety of real-world driving scenarios [28], and they are question–answer pair datasets which are suitable for our experiments.

4.1.1. Reason2Drive Database

The Reason2Drive Database [30] is a benchmark dataset with over 600 K video–text pairs. Each pair includes prediction, perception, reasoning steps, and question–answer pairs. Reason2Drive is the largest language-based driving dataset available. The dataset demonstrates a broader range of scenarios, including object- and scene-level data. This diversity includes object types, visual and motion attributes, object locations, and relationships related to self-driving vehicles. Furthermore, the dataset includes more complex Q&A pairs, demonstrating the process of incremental reasoning through enhancements in GPT-4 and longer text passages. In the Reason2Drive database, question–answers are paired into JSON-structured entries. Each object has various details about its driving behavior, such as location, category, attributes, and so on. In the JSON file, these data are populated into predefined templates categorized into perception, prediction, and inference tasks. Each task has a different level, and they can be categorized as object-level and scenario-level.

4.1.2. DriveLM Database

The DriveLM database [29] introduces the DriveLM-nuScenes and DriveLM-CARLA datasets that allow for the training and evaluation of GVQA tasks, which helps to improve the generalization capabilities and human–computer interaction of autonomous driving systems. All Q&A pairs in DriveLM are interconnected through logical dependencies on two levels: the object level and the task level. It contains 3.7 million Q&A pairs, collected through CARLA 0.9.14 data and generated using privileged rule-based experts in the Leaderboard 2.0 framework. The questions in the DriveLM dataset cover specific aspects of the driving task, most of which were annotated by human annotators, making the dataset suitable for use as a representative of human-like driving reasoning representation.

4.2. Results

Based on our model’s hallucination measurement (BLEU, ROUGE_L, and CIDEr), we use the optimization model to find the optimal configuration. We initialize some essential values in the optimization model.
The learning rate controls the step size of each parameter update. Smaller learning rates can make the optimization process more stable but slower in convergence. In comparison, more significant learning rates may make the optimization process faster but are prone to instability. A lower learning rate can stabilize the optimization process but with slower convergence. In contrast, a more significant learning rate may make the optimization process faster but is prone to instability. After testing different learning rates, we set our learning rate as 0.995 which is more suitable for our experiment. We use the learning rate decay rate to reduce the learning rate after each iteration so that the optimization process would be faster at the beginning and converge more steadily at a later stage. A decay rate close to 1 means that the learning rate decreases slowly. To avoid overfitting and save computational resources, we set up an early-stop threshold as 1 × 10 6 which monitors the objective function changes to decide whether to stop training. If the change in the objective function is less than this threshold after several iterations, the training process will stop early. Finally, we set the maximum number of iterations as 10,000 to avoid training for a long time, especially if the optimization can only converge slowly.
After setting these critical values, we bring the hallucination reduction evaluation data into the model separately and obtain the results shown in Figure 4.
This experiment uses a gradient descent algorithm with a Lagrange multiplier and learning rate decay to optimize the objective function L ( θ ) and achieve convergence. During the experiment, there is a significant decrease in the objective function L compared to the initial value, which indicates that the optimization greatly reduces the value of the objective function. The optimization result of θ is improved compared to the initial random value. Our final optimization result can be summarized in two aspects:
Convergence:The value of the objective function decreases gradually during the experiment and finally converges to a smaller value, indicating that the optimization process is effective.
Satisfaction of constraints:The final values of λ and μ indicate that both the resource constraints and the phantom thresholds are satisfied.
Based on the above experimental results, this experiment shows that the gradient descent and Lagrange multiplier methods using learning rate decay can effectively find the optimal solution that satisfies the constraints and significantly optimizes the objective function. This provides a practical solution for similar constrained optimization problems.

5. Conclusions

This paper considers the intricate optimization problem of cyber–physical systems in an intelligent power network, focusing on spurious detections by autonomous systems—similar to false detection problems in self-driving technologies. While such hallucinations may be harmless in a virtual context, they can lead to disastrous consequences in the real world if an object is falsely detected. This dual threat highlights the need to address prior sensory mismatches that lead to aberrant decision-making. We solve this problem by introducing a symmetrical framework that maps between physical and virtual spaces using large language models (LLMs). However, this mapping process often leads to hallucinations: discrepancies between the visual content a model “observes” and its textual responses. These errors manifest as misidentifications of objects or their relationships, requiring a robust solution.
We adopt a reflective approach by engaging the model in a loop of generating and storing correct answers. This iterative process allows the model to continually refine its outputs, improving factual accuracy. We present a loop-to-loop hypernetwork designed to enhance the reliability and stability of hallucination reduction. By employing iterative refinement and intuition-driven merging techniques, we effectively fine-tune the model within computational limits, thus improving system-wide performance.
With this approach, by slightly increasing the iteration count for hallucination reduction in each model, we achieve a balanced reduction in errors and resource utilization in a nearly symmetrical manner. This balance is crucial for ensuring efficient and secure operation, especially in smart power systems where mistakes can be costly. Our results indicate that, with appropriate design and iterative tuning, hallucinations can be minimized without compromising LLM performance. This symmetrical solution reduces resource consumption and enhances trust in the capabilities of smart systems, resulting in more reliable and safer real-world deployment.
Our current multimodal large language model combines ViT and GPT-2; it is effective for certain tasks, but there still have some significant limitations regarding flexibility and adaptability. Exploring alternative architectures or allowing for modularity in model design could enable more nuanced approaches to a broader range of input types and evolving tasks, ultimately enhancing the robustness and versatility of modeling frameworks in AI.
In the meantime, the inputs in our research only combine image and manual question–answer pairs. The exploration of integrating multimodal inputs, including audio and sensor data, represents a significant avenue for future research in enhancing decision-making processes within autonomous driving systems and avoiding more hallucinations generated by LLMs. By investigating how these diverse forms of data can be effectively combined, we can develop a more holistic understanding of the driving environment, leading to improved navigation accuracy and safety. In summary, further investigation into multimodal inputs is critical for driving the future of autonomous vehicles and enhancing their decision-making capabilities in complex and dynamic environments.
Developing user-friendly interfaces for AI models is also a key area of focus, especially as the technology becomes increasingly integrated into everyday experiences. By emphasizing intuitive design, simplified interaction methods, personalization, and comprehensive guidance, designers can greatly improve the ability of non-specialists to use advanced AI features. In future work, it will be important to systematically integrate user feedback mechanisms into the design process to facilitate continuous improvement based on real-world applications.

Funding

This research received no external funding.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from The Honda Research Institute-Advice Dataset and are available https://usa.honda-ri.com/HAD (accessed date: 13 June 2024) with the permission of Honda Research Institue US.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMsLarge Language Models
MLLMsMultimodal Large Language Models
NLPNatural Language Processing
RLHFReinforcement Learning from Human Feedback
HADHonda Academy—Recommended Dataset
ViTVisual Translator
BLEUBilingual Evaluation Understudy
ROUGERecall-Oriented Understudy for Gisting Evaluation
CIDErConsensus-based Image Description Evaluation

References

  1. Zhou, X.; Liu, M.; Yurtsever, E.; Zagar, B.L.; Zimmer, W.; Cao, H.; Knoll, A.C. Vision Language Models in Autonomous Driving: A Survey and Outlook. arXiv 2024, arXiv:2310.14414. [Google Scholar] [CrossRef]
  2. Arévalo, P.; Ochoa-Correa, D.; Villa-Ávila, E. A Systematic Review on the Integration of Artificial Intelligence into Energy Management Systems for Electric Vehicles: Recent Advances and Future Perspectives. World Electr. Veh. J. 2024, 15, 364. [Google Scholar] [CrossRef]
  3. Pelliccione, P. Open Architectures and Software Evolution: The Case of Software Ecosystems. In Proceedings of the 2014 23rd Australian Software Engineering Conference, Sydney, Australia, 7–10 April 2014; pp. 66–69. [Google Scholar] [CrossRef]
  4. Chakraborty, N.; Ornik, M.; Driggs-Campbell, K. Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art. arXiv 2024, arXiv:2403.16527. [Google Scholar]
  5. Düring, B.; Georgiou, N.; Merino-Aceituno, S.; Scalas, E. Continuum and thermodynamic limits for a simple random-exchange model. Stoch. Process. Their Appl. 2022, 149, 248–277. [Google Scholar] [CrossRef]
  6. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal Machine Learning: A Survey and Taxonomy. arXiv 2017, arXiv:1705.09406. [Google Scholar] [CrossRef]
  7. Homayouni, H.; Mansoori, E. Comparison of different objects in multi-objective ensemble clustering. In Proceedings of the 2017 Artificial Intelligence and Signal Processing Conference (AISP), Melbourne, Australia, 13–15 December 2017; pp. 71–74. [Google Scholar] [CrossRef]
  8. Chen, C.; Wu, H.; Su, J.; Lyu, L.; Zheng, X.; Wang, L. Differential Private Knowledge Transfer for Privacy-Preserving Cross-Domain Recommendation. In Proceedings of the ACM Web Conference 2022, WWW’ 22, Baltimore, MD, USA, 29 April–6 May 2022; ACM: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
  9. Ji, Z.; Yu, T.; Xu, Y.; Lee, N.; Ishii, E.; Fung, P. Towards Mitigating Hallucination in Large Language Models via Self-Reflection. arXiv 2023, arXiv:2310.06271. [Google Scholar]
  10. Tichouk; Sun, H.; Luo, X. Photoproduction of J/ψ with forward hadron tagging in hadronic collisions. Phys. Rev. D 2019, 99, 114026. [Google Scholar] [CrossRef]
  11. Rawte, V.; Sheth, A.; Das, A. A Survey of Hallucination in Large Foundation Models. arXiv 2023, arXiv:2309.05922. [Google Scholar]
  12. Tonmoy, S.M.T.I.; Zaman, S.M.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar]
  13. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
  14. Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.D.; et al. A Survey on Multimodal Large Language Models for Autonomous Driving. arXiv 2023, arXiv:2311.12320. [Google Scholar]
  15. Yang, Z.; Jia, X.; Li, H.; Yan, J. LLM4Drive: A Survey of Large Language Models for Autonomous Driving. arXiv 2023, arXiv:2311.01043. [Google Scholar]
  16. Chen, Y.; Ding, Z.H.; Wang, Z.; Wang, Y.; Zhang, L.; Liu, S. Asynchronous Large Language Model Enhanced Planner for Autonomous Driving. arXiv 2024, arXiv:2406.14556. [Google Scholar]
  17. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
  18. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
  19. Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A Survey on Multimodal Large Language Models. arXiv 2024, arXiv:2306.13549. [Google Scholar]
  20. Vemprala, S.; Bonatti, R.; Bucker, A.; Kapoor, A. ChatGPT for Robotics: Design Principles and Model Abilities. arXiv 2023, arXiv:2306.17582. [Google Scholar] [CrossRef]
  21. Wang, Z.; Bian, Y.; Shladover, S.E.; Wu, G.; Li, S.E.; Barth, M.J. A Survey on Cooperative Longitudinal Motion Control of Multiple Connected and Automated Vehicles. IEEE Intell. Transp. Syst. Mag. 2020, 12, 4–24. [Google Scholar] [CrossRef]
  22. Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Wang, Z. Receive, Reason, and React: Drive as You Say with Large Language Models in Autonomous Vehicles. arXiv 2023, arXiv:2310.08034. [Google Scholar] [CrossRef]
  23. Sriram, N.N.; Maniar, T.; Kalyanasundaram, J.; Gandhi, V.; Bhowmick, B.; Madhava Krishna, K. Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 5284–5290. [Google Scholar] [CrossRef]
  24. Mao, J.; Qian, Y.; Ye, J.; Zhao, H.; Wang, Y. GPT-Driver: Learning to Drive with GPT. arXiv 2023, arXiv:2310.01415. [Google Scholar]
  25. Omama, M.; Inani, P.; Paul, P.; Yellapragada, S.C.; Jatavallabhula, K.M.; Chinchali, S.; Krishna, M. ALT-Pilot: Autonomous navigation with Language augmented Topometric maps. arXiv 2023, arXiv:2310.02324. [Google Scholar]
  26. Sha, H.; Mu, Y.; Jiang, Y.; Chen, L.; Xu, C.; Luo, P.; Li, S.E.; Tomizuka, M.; Zhan, W.; Ding, M. LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving. arXiv 2023, arXiv:2310.03026. [Google Scholar]
  27. Nasir, M.U.; Earle, S.; Togelius, J.; James, S.; Cleghorn, C. LLMatic: Neural Architecture Search Via Large Language Models And Quality Diversity Optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO’ 24, Melbourne, VIC, Australia, 14–18 July 2024; ACM: New york, NY, USA, 2024. [Google Scholar] [CrossRef]
  28. Kim, J.; Misu, T.; Chen, Y.T.; Tawari, A.; Canny, J. Grounding Human-to-Vehicle Advice for Self-driving Vehicles. arXiv 2019, arXiv:1911.06978. [Google Scholar]
  29. Sima, C.; Renz, K.; Chitta, K.; Chen, L.; Zhang, H.; Xie, C.; Beißwenger, J.; Luo, P.; Geiger, A.; Li, H. DriveLM: Driving with Graph Visual Question Answering. arXiv 2024, arXiv:2312.14150. [Google Scholar]
  30. Nie, M.; Peng, R.; Wang, C.; Cai, X.; Han, J.; Xu, H.; Zhang, L. Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving. arXiv 2024, arXiv:2312.03661. [Google Scholar]
  31. Wang, Y.; Wang, Y.; Zhao, D.; Xie, C.; Zheng, Z. VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models. arXiv 2024, arXiv:2406.16338. [Google Scholar]
  32. Raman, R.; Gupta, N.; Jeppu, Y. Framework for Formal Verification of Machine Learning Based Complex System-of-Systems. INSIGHT 2023, 26, 91–102. [Google Scholar] [CrossRef]
  33. Krichen, M.; Mihoub, A.; Alzahrani, M.Y.; Adoni, W.Y.H.; Nahhal, T. Are Formal Methods Applicable To Machine Learning And Artificial Intelligence? In Proceedings of the 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 9–11 May 2022; pp. 48–53. [Google Scholar] [CrossRef]
  34. Deruyttere, T.; Vandenhende, S.; Grujicic, D.; Van Gool, L.; Moens, M.F. Talk2Car: Taking Control of Your Self-Driving Car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar] [CrossRef]
  35. Wu, D.; Han, W.; Wang, T.; Liu, Y.; Zhang, X.; Shen, J. Language Prompt for Autonomous Driving. arXiv 2023, arXiv:2309.04379. [Google Scholar]
  36. Sachdeva, E.; Agarwal, N.; Chundi, S.; Roelofs, S.; Li, J.; Kochenderfer, M.; Choi, C.; Dariush, B. Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning. arXiv 2023, arXiv:2309.06597. [Google Scholar]
Figure 1. Hallucination reduction model framework. In this model, for the evaluation part, we use Visual Translator (ViT) as an encoder for image captioning and the GPT-2 model as a decoder. This encoder–decoder architecture enables the model to accurately capture the relationships between visual elements and their textual representations, thus minimizing the potential for misidentifications or erroneous outputs due to hallucination artifacts that can stem from limitations in sensory data interpretation. Then, we use the GPT-4 model to handle both text and image inputs, ensuring that the generated answer more closely aligns with the visual content presented to it. Meanwhile, in each iteration, we use RLFH to reduce the hallucination. By using RLHF, the model can learn to adjust its responses based on the inputs provided from human evaluations. This training loop not only strengthens the model’s ability to produce accurate outputs but also enhances its overall reasoning and understanding of complex queries. After that, we combine the old and new answers and use ChatGPT to evaluate.
Figure 1. Hallucination reduction model framework. In this model, for the evaluation part, we use Visual Translator (ViT) as an encoder for image captioning and the GPT-2 model as a decoder. This encoder–decoder architecture enables the model to accurately capture the relationships between visual elements and their textual representations, thus minimizing the potential for misidentifications or erroneous outputs due to hallucination artifacts that can stem from limitations in sensory data interpretation. Then, we use the GPT-4 model to handle both text and image inputs, ensuring that the generated answer more closely aligns with the visual content presented to it. Meanwhile, in each iteration, we use RLFH to reduce the hallucination. By using RLHF, the model can learn to adjust its responses based on the inputs provided from human evaluations. This training loop not only strengthens the model’s ability to produce accurate outputs but also enhances its overall reasoning and understanding of complex queries. After that, we combine the old and new answers and use ChatGPT to evaluate.
Symmetry 16 01196 g001
Figure 3. Example of hallucination reduction. This is an example to show the process of our model. Firstly, we initialize our model, creating the RewardModel class and the EvaluationSuit class. Secondly, we load the prediction and test files. Thirdly, process the evaluation and record the score. Finally, apply reinforcement learning optimization into each iteration and record the new score.
Figure 3. Example of hallucination reduction. This is an example to show the process of our model. Firstly, we initialize our model, creating the RewardModel class and the EvaluationSuit class. Secondly, we load the prediction and test files. Thirdly, process the evaluation and record the score. Finally, apply reinforcement learning optimization into each iteration and record the new score.
Symmetry 16 01196 g003
Figure 4. Partial result of resource optimization.These are partial experiment results; these results are based on four evaluation metrics BLEU 2 ( a ) , BLEU 1 ( b ) , ROUGE L ( c ) , CIDEr ( d ) from our hallucination reduction model. Optimal theta is the final model parameters, and Lambda and Mu are the final Lagrange multiplier parameters.
Figure 4. Partial result of resource optimization.These are partial experiment results; these results are based on four evaluation metrics BLEU 2 ( a ) , BLEU 1 ( b ) , ROUGE L ( c ) , CIDEr ( d ) from our hallucination reduction model. Optimal theta is the final model parameters, and Lambda and Mu are the final Lagrange multiplier parameters.
Symmetry 16 01196 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J. Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving. Symmetry 2024, 16, 1196. https://doi.org/10.3390/sym16091196

AMA Style

Wang J. Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving. Symmetry. 2024; 16(9):1196. https://doi.org/10.3390/sym16091196

Chicago/Turabian Style

Wang, Jue. 2024. "Hallucination Reduction and Optimization for Large Language Model-Based Autonomous Driving" Symmetry 16, no. 9: 1196. https://doi.org/10.3390/sym16091196

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop