We introduce a novel method to generate complex 3D terrain models. This method overcomes the limitations of existing diffusion-model-based approaches, which struggle with large and complex terrains. This difficulty arises because diffusion models have limited spatial awareness, such as understanding directional information (north, south, east, west) within images, and cannot convert text describing terrain orientation into data. Additionally, this limitation makes it challenging to make localized changes to already-generated models using only text. Our method leverages the spatial awareness capabilities of large language models, bypassing diffusion models, and is divided into three steps: First, we design a Gaussian–Voronoi map data structure to generate complex terrain heightmaps from simple data inputs. Second, we construct a chain-of-thought behavior tree strategy to extract terrain data from text inputs. Lastly, we introduce a multiagent-based method for terrain feature adjustment to support detailed editing and updating of generated terrains through text.
3.1. Gaussian–Voronoi Map
The primary challenge in generating terrain data using large language models is the lack of stability when outputting large amounts of text, making it difficult to directly output complex terrain data. To tackle this challenge, we introduce the Gaussian–Voronoi map, an innovative data structure. This method combines the spatial partitioning capability of Voronoi diagrams with the smoothing characteristics of Gaussian functions to simulate diverse terrain features.
Figure 2 explains the working principle of the Gaussian–Voronoi map.
Specifically, we first use a Gaussian function to roughly express the height of a certain area:
where
h represents the elevation height represented by the Gaussian function,
w is the influence weight of the Gaussian function on the map, and
d is the distance from any point on the map to the center of the Gaussian function. Diverging from traditional Gaussian applications, we avoid using a covariance matrix for scaling and rotation control. Instead, we add multiple Gaussian functions to simulate these effects. This simplification helps the large language model process terrain features more efficiently, reducing potential errors. Instead, we achieve a similar effect by adding more Gaussian functions on the map. The advantage of this method is that, since the parameters of the Gaussian functions are generated by a large language model, omitting control over rotation and scaling simplifies the dimensions that the model must handle when analyzing terrain features, thereby effectively reducing the potential error rate. This strategy aims to leverage the capabilities of large language models while alleviating their burden in handling complex terrain generation tasks.
After generating Gaussian data, we integrate them into a Voronoi diagram. This step divides the space into regions, each influenced by a Gaussian peak, creating a more varied terrain. We use Voronoi diagrams to subdivide the plane into several approximately equal-sized areas. For any pixel point on the plane, we find the nearest Voronoi point
, thus determining the Voronoi area to which the pixel point belongs; next, for each pixel point corresponding to
, we shape the terrain details by overlapping Gaussians. To capture the terrain’s randomness and diversity, we go beyond letting each Gaussian impact the terrain uniformly. We employ a probability function,
, to decide if a Gaussian
affects a specific point, enhancing the model’s realism and flexibility. The probability function is determined by two Gaussian distribution functions:
where
,
, and
n have the same meanings as in the previous formula, with
n being the L2 norm of the map size. For all
determined to be superimposed on
, we simply sum their values to obtain the value of that block:
Finally, rendering each block according to its data into a grayscale image yields the 3D terrain heightmap. By inputting data to the large language model in the form of Gaussians and decoding them through the Gaussian–Voronoi map structure, we can convert simple summaries of terrain features into complex terrains; by adjusting the number of Voronoi blocks, we can obtain 3D terrain heightmaps of specified size and complexity. We chose Gaussian as the container for simple terrain features mainly because Gaussian can summarize the general terrain height with just four parameters, as mentioned above: the two-dimensional coordinates, elevation height, and influence weight, allowing the large language model to describe and adjust 3D terrains with high accuracy. On the other hand, the Gaussian construction is simple, clear, and common, reducing the difficulty of describing the necessary data content and its significance to the large language model. The strong advantages brought by Gaussian enable us to achieve high-quality data output even with zero-shot, which will be detailed in the experimental section. The choice of Voronoi for converting simple data to complex data is due to its ability to divide the entire image with almost equal weight and its rich randomness, which can better simulate 3D terrain scenes, as has been fully verified in some game-making methods before.
However, directly using the Gaussian–Voronoi map to generate heightmaps causes some problems: in many situations, this generation method may lead to overly abrupt height changes. The root cause of this problem is the superposition of multiple Gaussians, which may create significant height differences in specific areas compared with surrounding areas.
To solve this problem, we adopted a simple and effective method, namely, directly applying Gaussian blur on the Gaussian–Voronoi map. Although there was initial concern that this might cause terrain features with large height differences, such as cliffs, to appear smooth, in practice, we found that the height differences shaped by these terrain features themselves were sufficient to offset the smoothing effect of the blur. At the same time, the height differences in exceptional areas are not easily affected by the blur, so this method has shown considerable robustness in practice.
Figure 3 shows the effect of using Gaussian blur.
Through this simple and ingenious processing method, we successfully mitigated the issue of overly abrupt height changes caused by the superposition of multiple Gaussians, improving the smoothness of the generated heightmaps while maintaining a reasonable representation of terrain features. This adjustment not only effectively enhanced the quality of the generated results but also ensured that important details of the original terrain were not lost in the blurring process.
3.2. Chain-of-Thought Behavior Tree
Our goal is for large language models to process various types of text inputs and extract sufficient spatial information. This includes directly requesting the generation of specific geographical structures at certain coordinates or indirectly describing the orientation and partial features of terrains. To achieve this, the model must perform a degree of reasoning on the text inputs. This requires the large language model to perform a certain degree of reasoning on the text inputs. To accurately interpret the terrain and spatial information contained within the input texts, a common method involves guiding the large language model through a thought chain to think in steps according to a set logic, and extensive research has demonstrated the effectiveness of this approach. However, considering the diversity of text inputs, guiding the large language model to recognize text content in a fixed pattern still cannot accurately and effectively process various inputs.
We developed a zero-shot guidance method, named the chain-of-thought behavior tree. This method combines thought chains and behavior trees to guide the large language model. It processes data using different reasoning logics for analyzing terrain feature information. For each input text, we engage the large language model in multiple rounds of interaction. In each round, we present guiding text in a fill-in-the-blank format. The model is tasked with completing these blanks based on the input text. The answers to these fill-in-the-blanks are then used as criteria in the behavior tree to determine which logical path the large language model should follow in subsequent guidance. To maintain continuity and consistency of context, in subsequent guiding inputs, we not only provide new guiding texts but also include the answers formed by the large language model in the previous judgment process, thus achieving a step-by-step reasoning process.
Figure 4 illustrates a simple model of the chain-of-thought behavior tree. Here, the large language model first checks if the input text includes data for constructing Gaussians. If so, the data are directly cached; otherwise, it begins to build data step-by-step. In the data construction process, to ensure the consistency of data measurement scales, we have the large language model first convert text information from diverse natural languages into specific geographical concepts, and then generate data from these concepts according to the given scale standards.
The chain-of-thought behavior tree method significantly enhances the large language model’s ability to refine data from diverse input data when solving a single problem. Compared with a single thought chain, the behavior tree can dynamically adjust the reasoning path based on different features of the input, allowing the model to better adapt to different problems and scenarios, which is very important in our specified zero-shot scenario because it cannot be assumed for all types of problems. At the same time, by providing a clear structure of behavior trees and execution of thought chains, we offer clear decision-making and reasoning paths, making the model’s decision process more transparent and traceable, thereby enhancing the interpretability of the model’s output. By combining the Gaussian–Voronoi map and chain-of-thought behavior tree, the model is endowed with sufficient spatial awareness to generate diverse high-quality initial terrains. In the next section, we will discuss how to use text to make secondary modifications to the initial terrain.
3.3. Multiagent Update Strategy
In our study, we introduce a refined methodology to enhance the 3D terrain generation process using a new multiagent strategy. This strategy stems from the recognition that a 3D map embodies an ensemble of distinct terrain features, which are depicted using Gaussian functions within our Gaussian–Voronoi map framework. Our approach aims to facilitate the meticulous manipulation of specific map regions through the calibration of existing Gaussian functions or the introduction of new ones.
Upon the initial rendering of the 3D landscape, we implement a hierarchical multiagent system to refine and perfect the terrain. The process commences with the submission of modification directives and Gaussian parameters to the overseeing agent of agents. This pivotal agent assesses the overall terrain configuration to decide whether to engage the add or modify strategy.
When the add strategy is activated, the agent of agents, utilizing a bifurcated chain-of-thought behavior tree, initiates the process by incorporating additional Gaussians into the extant map schema. On the other hand, the modify strategy initiates with each agent conducting a preliminary hypothetical modification of their assigned Gaussian, independent of other Gaussians. The initial phase, or the first branch of their dedicated chain-of-thought behavior tree, enables an isolated assessment of potential adjustments. It does so without the influence of neighboring terrain features. Following this provisional step, the agents embark on the second part of their evaluation, the second branch of the behavior tree. Here, they scrutinize the outcomes of the assumed modifications in the context of the collective Gaussian landscape. This comprehensive analysis enables agents to discern the interplay between their Gaussian and adjacent ones, leading to a more informed and precise calibration of the terrain. Such a sequential approach ensures that any Gaussian whose modified presence is virtually nonexistent—whose influence on the terrain’s relief is minimal to none—is systematically removed. This two-part chain-of-thought process not only refines individual features but also harmonizes the collective adjustments, resulting in an intricately detailed and cohesive 3D terrain.
Both strategies are operable in tandem, with precedence given to the add strategy. This precedence ensures that modifications applied during the modify phase can refine the Gaussians introduced in the add phase, resulting in a more accurate and textually compliant terrain outcome.
Figure 5 shows our agent workflow.
This solution effectively introduces a secondary editing mechanism into our method, providing the ability to update and optimize the generated results, which is crucial for meeting specific requirements and a detailed design. Additionally, since the size of Gaussians is controllable, this approach also offers the possibility of fine-grained control over the terrain, allowing us to generate 3D terrains with infinite precision. An intuitive flaw in this method is that the same terrain feature represented by multiple Gaussians might be modified by multiple agents simultaneously, potentially leading to consistency issues in data measurement scales. However, we have already ensured the consistency of data scale processing by the large language model with the chain-of-thought behavior tree, so this issue does not require special consideration here.