3.1. Overview
In this section, we present a comprehensive overview of the proposed methodology, referred to as the Landscape framework, specifically designed to address structured learning problems encountered during complex decision and representational context. Our framework integrates modeling innovations and strategic algorithmic solutions to address fundamental challenges such as high-dimensional structural complexity, sequential ambiguity, and the presence of latent semantic relationships. Landscape, thus, aims to advance scalable and interpretable analysis in dynamic, data-rich environments. The
Section 3.2 defines the core problem and sets up the notational framework for our approach. We formalize the target learning task as a structured prediction problem characterized by ambiguous observations and high-order dependencies across spatial and temporal dimensions. The primary objective is to learn a function or policy that maps partially observable states or high-dimensional inputs into target outputs, while preserving structural consistency embedded in the data.
To this end, we introduces the formal hypothesis space, the relevant latent state space where applicable and the structural assumptions embedded in the data. We also frame a generalized decision process abstraction such as a Partially Observable Markov Decision Process (POMDP) when applicable, which is crucial to understanding the underlying complexity of decision-aware representation learning. Notably, we refrain from decomposing the problem in a simplistic enumeration, instead presenting a unified mathematical formulation through a series of optimization-based definitions and operator-centric constructions. The Structural Model, described in
Section 3.3 and termed the Structured Cone Machine (SCM), presents a model-based design tailored to represent hierarchical decisions and abstract semantic states. Inspired by geometric representations such as simplex coding in multicategory SVMs and decision policies over partially observable environments, the SCM unifies the strengths of both statistical margin-based learning and structured geometric encoding.
In simpler terms, a Partially Observable Markov Decision Process (POMDP) is like making decisions in a foggy environment where one cannot see the entire picture at once. For instance, imagine an autonomous drone navigating through a city during heavy fog. It cannot directly observe all buildings, pedestrians, or vehicles, so it must rely on partial sensor readings (e.g., LiDAR scans, camera frames) to estimate what is really around it. The POMDP framework mathematically models this uncertainty by combining three elements: States, (what is truly happening in the environment, even if hidden), Observations (what the system actually senses), and Actions (decisions made based on those observations). By continuously updating its belief about the hidden state, the system can make better decisions despite incomplete information.
The Structured Cone Machine (SCM) takes this a step further by embedding these belief states into a geometric cone-shaped representation. This conic embedding helps preserve both the hierarchy and relationships between features in high-dimensional visual data. Think of it as projecting complex patterns-like seasonal vegetation changes in satellite images onto a structured shape where distances and angles have semantic meaning. This makes it easier for the model to recognize patterns across scales and contexts.
Belief-aware optimization (ABO) ensures that the learning process adapts in real-time as the model’s confidence changes. If the system is highly uncertain (e.g., detecting an unusual urban feature it hasn’t seen before), ABO adjusts the optimization strategy to explore more possibilities. Conversely, when confidence is high, it focuses on refining the solution. This adaptability improves robustness, especially when handling seasonal shifts, mixed urban–rural regions, or noisy sensor data.
Combining POMDP reasoning, SCM’s geometric structuring, and ABO’s adaptive learning enables our framework to handle complex, high-dimensional image analysis tasks with both precision and resilience. This has direct applications in sustainable urban planning, disaster response mapping, and environmental monitoring, where decisions must be made quickly despite uncertain or incomplete data. To enhance interpretability for readers unfamiliar with the underlying mathematics, we include a summary diagram of the proposed SCM-ABO framework, illustrating how POMDP reasoning, conic geometric embedding, and belief-aware optimization interact to produce robust segmentation outputs.
Figure 1 shows the end-to-end flow of our framework, starting from an input image, passing through POMDP reasoning for decision-making under uncertainty, conic geometric embedding for structured feature representation, and belief-aware optimization for adaptive learning, culminating in the final segmented output.
The model introduces a conic embedding space with tractable optimization properties, ensuring that multiple structured classes or states can be jointly modeled without requiring hard combinatorial decompositions. The model is constructed to preserve both expressive capacity and computational feasibility. The derivation involves matrix representations, convex relaxations, and factorization with kernel alignment to achieve a balance between interpret-ability and discriminative power. In
Section 3.4, we introduce a strategy, referred to as Adaptive Belief Optimization (ABO), that supports effective training and deployment of the proposed SCM model under real-world constraints. ABO employs decision-theoretic foundations to optimize policy decisions in scenarios where labels may be noisy, rewards are sparse, or actions influence future states. The strategy builds upon reinforcement-style updates but extends them by incorporating a belief-aware regularization that dynamically adjusts the model’s internal representation of uncertainty and reliability. This section emphasizes how such a strategy allows SCM to be not only predictive but also adaptive, improving its robustness against domain shifts, occlusions, or compositional ambiguities. The strategy is developed in a streaming data setting, which requires more on-line learning principles and memory-efficient approximations.
3.2. Preliminaries
We begin by formalizing the structured learning problem underpinning the Landscape framework. The primary goal is to learn a predictive function or a decision policy in a structured output space, where both the input observations and the output configurations exhibit high-order dependencies, latent uncertainty, or partial observability. This section develops mathematical and notational scaffolding that supports the subsequent design of the model and strategy. Our formulation draws on tools from optimization theory, structured prediction, and decision processes.
Let
denote the input space,
be a structured output space, and
be an optional latent space used to encode hidden or unobserved semantic states. Let
be an observed training example, with
. The learning task is to construct a function
(or more generally,
) that minimizes a risk functional over some structured loss
. The process begins with an input image
I, which is processed by the backbone network to extract multi-scale feature representations
. This operation is defined in Equation (
1) and is visually depicted in Figure 3,
Backbone Feature Extraction Module. The population risk is mathematically defined as
where
denotes the joint distribution over inputs and structured outputs. Since
is unknown, we minimize the empirical risk with regularization, expressed as
where
is a regularizer enforcing structure such as smoothness, sparsity, or margin constraints. The backbone features
are transformed using the
Conic Output Embedding module to obtain
, as given by the formulation in Equation (
2). This operation corresponds to Figure 3,
Conic Output Embedding Module. The embedding ensures that the feature space is structured for downstream alignment and prediction tasks. In structured output settings, the loss
ℓ is not decomposable over individual labels or tokens. For instance, in sequence labeling tasks, one may use the Hamming loss, structured hinge loss, or negative log-likelihood over output sequences as follows:
where
is a joint feature map, and
is a task-specific loss function. To model partial observability, we introduce a latent variable
representing hidden context, state, or alignment. Let the model decompose as
, where
with parameters
to be learned. This formulation, as expressed in the above equation, gives rise to a latent structured prediction problem, requiring joint inference over
. In
Landscape, structured output configurations are encoded in a conic representation inspired by simplex geometry, formulated as
where
form a simplex code. The model learns to align predictions with the conic directions associated with correct outputs. The above formalism serves to establish a rigorous mathematical foundation for the
Landscape method. The formulations reflect the complexities of learning in structured, partially observable, and high-dimensional environments. In the following section, we construct a model based on the Structured Cone Machine that operationalizes this formalism and addresses the computational and representational challenges posed by the structure of
and
.
3.3. Structured Cone Machine (SCM)
To improve the coherence and readability of our framework, we provide an integrated schematic diagram that illustrates the relationships among the Structured Cone Machine (SCM), Adaptive Belief Optimization (ABO), Latent Feature Alignment, and Kernel/Multiview Extensions. This overview contextualizes the interaction of key components in our model pipeline (as shown in
Figure 2). We introduce the Structured Cone Machine (SCM), a predictive model that combines geometric conic embeddings with latent structure and convex optimization. The SCM is designed to handle structured output tasks under ambiguity and high-order constraints, with theoretical guarantees and interpretability (as shown in
Figure 3). This explicitly clarifies how the upper pipeline relates to the three detailed modules shown below, while also defining all symbols and legend items.
Figure 3 now has a clear two-level organization: The top row depicts the full encoder–decoder pipeline, whereas the bottom panels illustrate three functional modules in detail. Specifically, Block A corresponds to the
Sparse Self-Attention (SSA, left, green panel), Block B corresponds to the
Dilformer/Latent Feature Alignment (LFA, center, peach panel), and Block C corresponds to the
Dilated Self-Attention (DSA, right, green panel). Each “Conic Output Embeddings” box along the pipeline invokes one of these modules at the corresponding spatial scale.
In the encoder (downward path), the feature maps of size are processed by Block A (SSA), repeated times, followed by Block B (Dilformer/LFA), repeated times, and then progressively downsampled to , , and , where Block C (DSA) is applied with repeat counts and to capture multi-rate context. In the decoder (upward path), features at resolutions and are processed by Block B and Block A, respectively, before returning to the full-resolution map of size . A final convolution then produces the de-raind output. The labels indicate the number of repeated applications of the corresponding module at each scale.
The functionality of each module is explicitly described in the figure and text. Block A (SSA) constructs the query, key, and value (
) using three dilated
convolutions with rates
and 2, followed by
projections and reshaping. These yield sparse attention maps, which gate the input through element-wise multiplication (⊗), with a residual connection added via (⊕). Block B (Dilformer/LFA) applies Layer Normalization, Latent Feature Alignment, and Kernel/Multiview Extensions, thereby aligning intermediate features with the conic output space. Block C (DSA) consists of stacked
dilated convolutions with dilation rates
and 3, each followed by a
projection and ReLU activation, with results aggregated by residual addition (⊕) and concatenation. Shapes along the top pipeline are explicitly annotated as
ensuring that dimensional changes are traceable at every stage. Importantly, each “Conic Output Embeddings” box denotes projection into the conic label space, as described in
Section 3.2, so that all scales consistently employ the same geometric coding before and after the attention modules.
The caption now also defines all icons: Circled ⊕ denotes element-wise addition, circled ⊗ denotes element-wise multiplication, the symbol represents convolution, and the “Concatenate” symbol indicates channel concatenation; the “dil-conv (rate = r)” tiles specify dilated convolutions with dilation factor r. Finally, the revised text introduces explicit cross-references between the equations and the named modules (SSA, Dilformer/LFA, DSA) at the points where their mathematical formulations are presented, thereby removing any ambiguity between the overall pipeline and the detailed module diagrams.
In
Figure 3,
SSA denotes the
Sparse Self-Attention module, which captures long-range dependencies while reducing computational overhead via sparsity constraints.
DSA refers to the
Dilated Self-Attention module, which enlarges the receptive field through dilation rates of 1, 2, and 3, thereby integrating multi-scale contextual information.
Dil-Conv represents
Dilated Convolution, where the dilation rate
r controls the spacing between kernel elements to balance resolution preservation and contextual coverage.
Each architectural block in the schematic is labeled and color-coded, with functional roles described directly in the legend: The green-shaded block depicts the SSA process (including query, key, and value generation, sparse attention map computation, and residual addition), the orange-shaded block illustrates the Dilformer (Layer Normalization, Latent Feature Alignment, and Kernel/Multiview Extensions), and the light-green block represents the DSA module (stacked dilated convolutions and ReLU activations). Additionally, arrows indicate the explicit data flow between modules, and icons for operations (element-wise addition, multiplication, and concatenation) are annotated in the legend. These modifications ensure that the architecture, its components, and the data processing pipeline can be fully understood without requiring the reader to cross-reference earlier sections. This figure illustrates the step-by-step flow of the proposed system for processing an input image of size . First, the image passes through the initial feature extraction layers, where low-level visual patterns such as edges, textures, and colors are captured.
These extracted features are then fed into Block A (feature enhancement), which enriches and amplifies important spatial details. Next, the output from Block A enters Block B (context aggregation), which gathers multi-scale contextual information to improve the understanding of global and local structures. The refined features are subsequently processed by Block C (prediction refinement), which fine-tunes the feature maps to produce accurate and high-quality predictions. Finally, the output is passed through the final layers to generate the desired result, with each block seamlessly integrated into the pipeline to ensure smooth information flow from input to output.
In structured classification tasks where relational geometry between output classes matters—such as taxonomy prediction, morphological categorization, or semantic segmentation—the Structured Cone Machine (SCM) replaces traditional one-hot encodings with geometrically structured output embeddings (As shown in
Figure 4). In the proposed architecture,
H and
W denote the height and width of the input feature map (in pixels), respectively, while
C represents the number of channels. The processing begins with Layer Normalization (LN), which standardizes the input feature distribution to enhance training stability, followed by a linear projection layer (
Project) that maps features into the target embedding dimension
d. Sparse Self-Attention (SSA) is then applied to focus computation on the most informative regions by ignoring irrelevant positions, whereas Dilated Self-Attention (DSA) expands the receptive field via dilation rates to capture broader contextual information. A Rectified Linear Unit (ReLU) introduces non-linearity by suppressing negative activations, and learnable weight parameters
W (yellow blocks in the figure) are employed for feature combination or transformation. Attention scores are normalized using the
Softmax function to obtain probability distributions over positions. Finally, the Cone Output Embedding assigns each class a unique conic direction, enabling classification based on angular separation in the embedding space.
The
Figure 4 depicts the processing pipeline of the
Structured Cone Machine (SCM) with Conic Output Embeddings. The input image of dimensions
is first normalized using
Layer Normalization (LN) to stabilize feature distributions. These normalized features are projected into a lower-dimensional representation through a linear projection layer. The projected features are then processed by two attention mechanisms:
Sparse Self-Attention (SSA) and
Dilated Self-Attention (DSA). SSA selectively focuses on critical regions by introducing sparsity into the attention computation, while DSA expands the receptive field via dilation, enabling multi-scale context capture. The results are further transformed by a ReLU activation function and integrated through learned weights (
) at multiple points in the network. Finally, features are realigned using the
Cone Output Embedding mechanism, where each class is represented by a distinct conic direction in the embedding space, enabling angular decision boundaries for classification. The final projection produces the structured output predictions.
Each class label
is assigned to a conic direction
, chosen such that all class vectors lie on a unit-radius spherical simplex centered at the origin. This encoding enforces angular separation between class labels, capturing mutual exclusivity while preserving geometric uniformity. Specifically, the conic codebook satisfies
This formulation, as outlined in the above equation, ensures that all codes are equidistant in angle, and their sum vanishes, forming a zero-mean, maximally discriminative embedding basis. Such a conic output structure is particularly advantageous when class similarities or transitions must be interpreted geometrically, such as in gradual evolutionary categories or skill acquisition levels.
To enhance the interpretability of Equation (
6), we introduce a visual guide in
Figure 5 that demonstrates the principle of conic separation. In this arrangement, each class embedding is mapped to a unique directional vector radiating from the origin, with equal angular spacing between them. This configuration forms a set of non-overlapping conic regions in the embedding space, enabling clear class discrimination while preserving geometric consistency. By visualizing the embeddings in this manner, the concept becomes accessible even to readers unfamiliar with the underlying cone geometry.
Figure 5 visually demonstrates conic separation in a high-dimensional embedding space, illustrating how class labels are mapped to distinct, uniformly spaced conic directions on the surface of a unit-radius sphere. Each cone represents a class vector oriented at a fixed angular distance from others, ensuring mutual exclusivity while maintaining geometric uniformity. The symmetry reflects the property from Equation (
6) where the dot product between identical class vectors equals 1, while between different class vectors equals a constant negative value, ensuring equal angular spacing. The convergence of cone bases at the origin symbolizes the zero-mean constraint
, which balances the embedding space and prevents bias toward any class direction. This spatial structure facilitates clear class discrimination while enabling the smooth interpretation of transitions between similar classes.
Furthermore, the summation constraint
is now explicitly explained as enforcing a balanced, zero-mean embedding space. This ensures that the embedding vectors are symmetrically distributed around the origin, preventing directional bias toward any specific class. Such balance is critical for maintaining stability in the optimization process and improving the model’s ability to generalize across diverse datasets. For an input sample
, a feature mapping
is computed using a deep encoder or kernelized transformation, followed by a linear scoring layer that projects the input into the conic output space as specified:
Here,
and
define the learnable projection to the output cone. The model predicts the class whose conic direction has the highest inner product with the projected feature vector
, effectively performing angle-based classification in the embedding space. Unlike softmax-based models, which rely on unstructured logits, this approach yields geometrically meaningful decision boundaries. To train SCM, a margin-based structured hinge loss is employed. For a given training example
with ground-truth label
, the loss encourages the prediction vector
to align more closely with the correct conic direction
than with any other
, by at least a margin
:
This hinge-style formulation penalizes all incorrect classes that are too close in angle to the predicted direction, promoting sharp angular separation and encouraging the model to maintain strong confidence margins as formulated in the above equation. To prevent overfitting and enhance generalization, a conic norm regularizer is applied to both the projection weights and the output predictions, as given by
where
is the Frobenius norm and
and
control the weight decay and output energy, respectively. This regularization ensures that the classifier respects the geometric assumptions of the conic embedding space. Moreover, to better handle ambiguous samples near class boundaries, an angular temperature scaling can be introduced during inference to convert directional scores into probabilistic confidence levels by employing
where
is a temperature parameter controlling the sharpness of the prediction distribution. Lower
emphasizes the winner-take-all decision, while higher
smooths predictions across neighboring classes, beneficial in hierarchically related outputs. Overall, the SCM framework provides a structured geometric alternative to conventional classification, aligning latent predictions with interpretable directional codes. Its design is particularly suited for tasks requiring relational inductive bias, uncertainty-aware outputs, or interpretable transitions between adjacent classes, and integrates seamlessly with deep neural backbones for end-to-end training.
To handle inherent ambiguities in structured input-output mappings, the Structured Cone Machine (SCM) introduces a latent variable framework wherein each input
is associated with an unobserved configuration
that modulates its feature embedding. These latent variables represent interpretable factors such as viewpoint, context, or missing modality, and are incorporated into the model through the composite feature map
. The prediction function is, thus, extended to be conditioned on latent structure as follows:
where
W and
b are model parameters shared across all latent configurations. During training, for each input-label pair
, the model evaluates the structured margin loss over all possible latent states and selects the one that yields the highest hinge violation, formalized as
with
denoting the desired conic margin. This loss promotes latent structures that best explain the discrepancy between the current prediction and the ground truth label. To optimize this objective efficiently, we adopt an alternating strategy: In the E-step, the best latent configuration
is selected via enumeration or greedy heuristics; in the M-step, model parameters
W are updated to minimize the empirical risk given fixed
. Before prediction,
undergoes a series of intermediate transformations, such as attention weighting, normalization, and residual connections. These operations are captured by Equation (
12).
Figure 3 corresponds to the sub-blocks within
Module C, where feature refinement occurs. Furthermore, to prevent overfitting to spurious latent states, this introduces an entropy-based regularization that encourages distributional smoothness over
:
thereby promoting latent diversity and robustness. For settings with annotated or weakly labeled latent variables, the optimization can be further guided via a KL-divergence alignment that is represented as
where
is the supervision-derived distribution and
is the model-inferred posterior. The final training objective becomes
where
and
are regularization hyperparameters in the above equation. This latent alignment approach significantly enhances the SCM’s ability to disambiguate structured outputs under partial observability or missing modality conditions, while maintaining interpretability and generalization.
The Structured Cone Machine (SCM) framework extends beyond linear models by incorporating both kernelized and multiview feature representations, thereby enabling flexible, non-linear, and modality-aware structured prediction. These extensions preserve the conic embedding geometry while broadening the SCM’s applicability to heterogeneous data sources and complex decision boundaries. In the kernelized variant, SCM lifts the input
x into a reproducing kernel Hilbert space (RKHS) via a kernel function
, avoiding the need for explicit feature mapping. The classifier function is expressed in dual form as
where
are dual coefficients optimized during training, and
is a positive-definite kernel such as the RBF, polynomial, or graph kernel depending on input modality. This representation allows SCM to capture non-linear relationships between inputs while preserving the directional decision rule inherent in the conic embedding. The classifier, thus, generalizes to complex manifolds without sacrificing the geometric interpretability or margin-based separation of its outputs. In applications involving multimodal or multiview data—such as audio-visual classification, multi-angle morphology analysis, or structured sensor fusion—the SCM supports view-specific encodings by learning separate embeddings for each modality or feature group. Given
V distinct views or channels, each with its own feature extractor
, the conic predictor is defined as
where
are projection weights for view
v, and
b is a shared bias vector. This formulation enables the SCM to model decomposable feature spaces where different views contribute complementary information, such as shape vs. texture, or kinematics vs. force. The training objective jointly optimizes all
under the structured hinge loss while preserving shared conic output space alignment, promoting cooperative learning across modalities. To balance the contributions of each view and enforce consistency across latent representations, we introduce a view regularization term expressed as
where
is the partial prediction from view
v,
controls view-specific weight decay, and
penalizes disagreement between views. This encourages aligned representations across modalities and ensures that the final prediction is not dominated by any single source, which is critical in noisy or partially missing data scenarios. The refined features are passed to the
Belief-Aware Conic Prediction module, which produces the preliminary prediction map
. The prediction process is governed by Equation (
18), which defines the conic mapping, belief weighting, and prediction fusion. This corresponds to
Figure 3,
Belief-Aware Conic Prediction Module. In the kernelized multiview setting, each view may have its own kernel
, allowing even greater flexibility and represented as
thereby allowing view-specific similarity metrics. This is particularly useful in heterogeneous data regimes where each modality may follow different statistical properties. The SCM’s convexity in the dual and compatibility with conic embedding constraints enables efficient optimization using second-order solvers or projected subgradient methods. This makes the SCM both scalable and theoretically grounded, with strong generalization guarantees. Through Kernel and Multiview Extensions, the SCM bridges geometric output structures with nonlinear and multimodal input representations, enabling high-fidelity, interpretable, and robust predictions across a wide range of structured learning tasks.
3.4. Adaptive Belief Optimization (ABO)
To enhance the Structured Cone Machine (SCM) under uncertainty, we propose Adaptive Belief Optimization (ABO)—a dynamic framework that integrates belief reasoning, action control, and latent uncertainty into model adaptation (as shown in
Figure 6).
In the illustrated framework, refers to the feature representation derived from the reference or primary input stream, while represents the feature representation from the target or secondary input stream. The Structured Memory and Belief Updates module stores and updates feature memory states while incorporating belief information to maintain temporal and contextual consistency. The Action Policy Over Beliefs serves as a decision-making mechanism that selects optimal actions based on the current state of learned beliefs. The Belief-Aware Conic Prediction module projects features into a conic embedding space, enabling the angular separation of classes for more robust classification. Fusion denotes the process of merging multiple feature maps to enhance their discriminative power. The Detection Head is a task-specific prediction module responsible for generating detection outputs, such as bounding boxes or class scores. The terms , , and correspond to value, key, and query features, respectively, which are employed in correlation and attention mechanisms. The normalized correlation map is a similarity matrix indicating the degree of alignment between features from two sources after normalization. The parameters a and b are learnable scaling factors that modulate the influence of feature maps, while represents the sigmoid activation function that normalizes values between 0 and 1.
Figure 6 illustrates a belief-aware detection framework that integrates structured memory updates with conic prediction for improved decision-making. Initially, feature representations
and
undergo
Structured Memory and Belief Updates, which refine and store contextual cues over time. These updated features are processed by two core modules:
Action Policy Over Beliefs, which guides decision-making based on belief states, and
Belief-Aware Conic Prediction, which aligns outputs along angular decision boundaries for structured predictions. The resulting outputs are aggregated through additive operations and sent either directly to individual
Detection Heads or combined through a
Fusion step to generate a fused feature representation
, which is then processed by the final Detection Head. A detailed block at the bottom of the figure shows the correlation computation process, where value (
), key (
), and query (
) features are used to generate a normalized correlation map. This map undergoes further transformations, scaling (
a,
b), and convolutional operations before producing enhanced feature maps for downstream detection.
In dynamic or partially observable settings, predictive uncertainty over latent factors must be explicitly modeled to improve decision robustness (as shown in
Figure 7). Feature extraction refers to the process of converting raw data into meaningful numerical representations that encode spectral, spatial, and temporal information. The
Spectral Transformer Block learns frequency-domain relationships by analyzing variations in signal power across different spectral components. The
Spatial Transformer Block captures positional and structural dependencies within spatial layouts of features. The
Temporal Attention Block selectively focuses on temporally significant patterns by computing weighted feature aggregations over time. The
Transformer Block implements multi-head self-attention with normalization and feed-forward layers to enhance feature integration.
Belief-Aware Conic Prediction projects features into a conic embedding space, increasing angular separation between classes for improved discriminability. The
Fully Connected (FC) Layer aggregates processed features into a fixed-length vector for classification. In the classification stage, predicted outcomes fall into categories such as positive, neutral, or negative.
Positional Encoding embeds sequential order information into features, enabling temporal modeling. Within the
Temporal Attention Block,
Weighted Sum,
Softmax, and
Linear layers work together to assign, normalize, and transform attention scores for optimal temporal feature selection.
The illustrated framework depicts a belief-aware detection and classification pipeline for spatiotemporal feature learning. Initially, raw data undergo feature extraction, producing multi-channel spectral–spatial–temporal representations. These are processed sequentially through the Spectral Transformer Block, which captures frequency-domain dependencies, and the Spatial Transformer Block, which models spatial relationships among features. The Temporal Attention Block then emphasizes relevant temporal patterns across frames by assigning adaptive attention weights. Features are further refined through a Fully Connected (FC) Layer before being classified into categories such as positive, neutral, or negative. Parallelly, a Belief-Aware Conic Prediction module transforms extracted features into a conic embedding space to enhance class separability, ensuring robust predictions.
The
Figure 7 is an integral part of the proposed Belief-Aware Conic Prediction framework. It visually illustrates the incorporation of a belief distribution
over latent structures
, such as pose configurations, sensory alignment, or context states. Rather than committing to a single latent configuration, the framework computes the belief-weighted model response by marginalizing the latent-conditioned outputs of the Structured Cone Machine (SCM), as mathematically defined in Equation (
18). This figure directly corresponds to the Belief-Aware Conic Prediction Module in the proposed pipeline and is not merely illustrative, but a core representation of the methodology.
The Belief-Aware Conic Prediction framework incorporates a belief distribution
over latent structures
, such as pose configurations, sensory alignment, or context states. Rather than committing to a single latent configuration, we define the belief-weighted model response by marginalizing the latent-conditioned outputs of the Structured Cone Machine (SCM), described as
Here,
denotes the SCM’s latent-parameterized conic prediction, and
is the canonical vector representing class
k in the conic embedding space. The posterior predictive decision, thus, reflects both the uncertainty in latent states and the structured class separation. This aligns the training objectives with this perspective by minimizing the expected conic margin loss under the belief distribution:
where
is the structured hinge loss over conic directions, and
W is the SCM weight matrix. To obtain the belief distributions
, we use Bayesian updates based on observed signals or auxiliary outcomes. Let
be the observed evidence related to
, such as a sensory cue or label proxy. Then, the belief state evolves over time via
where
is the observation model and
is the latent transition kernel. To improve robustness, predictions from multiple scales are fused. This step is defined by Equation (
22) and represented in
Figure 3, showing fusion arrows and the combined output node. These beliefs can be learned jointly or approximated using domain priors. For increased adaptability, we define an entropy-regularized policy over belief states to determine model actions such as update, skip, or reset as follows:
where
Q is a belief-conditioned action-value function,
is the Shannon entropy of the belief, and
is a regularization constant controlling conservativeness. This mechanism guides the model to balance exploration of uncertain latent states with exploitation of confident predictions. To ensure numerical stability and interpretability, we re-parameterize the prediction using normalized barycentric projections as indicated:
This normalization allows fair comparison across different input magnitudes and belief variances. Altogether, the belief-aware conic prediction strategy provides a principled, uncertainty-sensitive approach to structured classification under latent ambiguity, improving reliability and generalization in real-world deployment. The normalization in Equation (
24) via barycentric projection serves a dual purpose in ensuring fairness and interpretability. By projecting the raw prediction vector
onto a unit
-norm sphere,
we remove the influence of varying input magnitudes and belief-state variances that could otherwise bias the decision boundaries. This normalization guarantees that comparisons between the projected feature vector
and each class anchor
are based purely on directional similarity rather than scale, thus promoting fairness across heterogeneous data conditions. Additionally, by constraining predictions to a normalized representation, the resulting decision process becomes more interpretable and angular relationships between
and
directly encode confidence and class separation in the conic embedding space. This property enhances both the transparency and the robustness of the belief-aware conic prediction strategy in real-world applications.
The Adaptive Belief Optimization (ABO) framework introduces a belief-aware decision policy designed to manage uncertainty in dynamic state estimation tasks such as object tracking, temporal segmentation, or sequential hypothesis refinement. Unlike conventional policies that operate on observable states, ABO defines a policy
over the belief space
—a probability distribution over latent states
—combined with observed input
to choose an action
. The action set typically includes discrete operations such as
track (continue with current belief),
update (refine belief using new evidence), and
reinit (reset belief due to high uncertainty or drift). At each timestep
t, the policy selects the optimal action based on a learned state-action value function, which is summarized as follows:
where
is the current belief distribution,
is the reward signal,
is a discount factor, and
is the learning rate. The
Q-function estimates the expected return for taking action
a in the current belief–observation pair, and is updated via temporal-difference (TD) learning, allowing the agent to adaptively refine its behavior over time. To prevent the model from overcommitting to uncertain or poorly supported beliefs, ABO incorporates an entropy-based regularization strategy. This encourages exploration and maintains robustness against premature convergence, especially in ambiguous environments or noisy data regimes. The entropy of the belief state is defined as
which quantifies the model’s uncertainty over possible latent states. A high-entropy belief reflects significant ambiguity, whereas low entropy indicates a peaked distribution. Using this metric, the action policy is modified to penalize overconfident decisions in uncertain regions:
where
is a tunable regularization coefficient that determines the weight of entropy in the action selection process as indicated in above equation. This formulation biases the policy toward conservative actions—such as deferring updates or invoking resets—when belief entropy is high, thereby improving stability and reliability. To further refine belief states, ABO integrates a belief transition model
that maps current beliefs and observations into updated distributions. Given an action
, the transition dynamics can be expressed as
where
are the model parameters learned via variational inference or supervised likelihood maximization. This enables ABO to simulate the downstream effect of its policy choices on belief trajectories, facilitating more informed decision-making in complex temporal environments.
To evaluate policy performance under uncertainty, we presented a risk-adjusted return metric that penalizes low-confidence decisions:
where
is a risk-aversion hyperparameter, and
represents belief uncertainty. This metric rewards high-reliability decisions while discouraging hasty updates based on diffuse or misleading evidence. Altogether, ABO provides a principled, uncertainty-aware reinforcement learning framework that integrates belief modeling, entropy regularization, and action scheduling. It is particularly effective in applications where latent state evolution is complex, costly to observe, or sensitive to context—such as in neuro-mechanical control, fossil behavior simulation, and autonomous systems with partial observability. The combination of belief dynamics and policy optimization enables ABO to deliver robust, adaptive performance across diverse decision-making domains.
In dynamically evolving environments where latent variables govern observation dynamics, maintaining a coherent internal belief structure is critical for robust decision-making. The belief state
over latent variables
is iteratively refined through a Bayesian update mechanism, integrating observations and latent transition dynamics. At each time step
t, upon receiving observation
and taking action
, the belief update is computed as
where the denominator ensures normalization over the latent state space. The transition model
captures how actions modify hidden dynamics, while the likelihood model
evaluates compatibility with new evidence. To ensure sample efficiency and temporal coherence, belief states are coupled with a structured experience memory
, storing tuples of belief–action–outcome sequences. The predicted map
is compared with the ground truth
using a composite loss function that integrates region-specific penalties, edge-aware loss, and structural similarity constraints. These are mathematically described in Equation (
30) and are shown in
Figure 4,
Loss Block. These samples are periodically drawn to compute the replay loss as expressed:
where
is a belief-conditioned action-value function, and
is the temporal discount factor. This memory-based update encourages temporal credit assignment while avoiding catastrophic forgetting in non-stationary contexts. To further stabilize belief propagation, we incorporate a belief entropy penalty represented in Equation (
31) that discourages overly sharp distributions in early learning phases:
which acts as a regularizer to maintain sufficient uncertainty when evidence is sparse or conflicting. To capture long-range dependencies and abstract behavioral motifs, we augment the memory
with a structured graph
where each node corresponds to a belief embedding and edges encode action-induced transitions. Let
denote an embedding of belief
via a neural encoder
. Then, we define a contrastive alignment loss between observed and predicted belief trajectories, expressed as follows:
where
denotes that belief transitions are aligned via common actions. This graph-based regularization enforces geometric smoothness over latent decision paths. Decision-making under structured belief evolution is guided by a dynamic policy
that accounts for memory-informed belief embeddings and context vectors, improving adaptability under shifting data distributions or evolving supervision regimes. The joint optimization objective integrates replay loss, entropy penalty, and graph alignment as outlined:
with hyperparameters
and
tuned via validation. This structured belief-and-memory system supports continual learning, efficient transfer, and robust generalization in complex, partially observable domains. This process is represented in the training loop. The final trained model produces the output map
for unseen inputs, as defined in Equation (
34) and shown at the far-right output node of
Figure 4.