Survival Analysis as Imprecise Classification with Trainable Kernels

Konstantinov, Andrei; Utkin, Lev; Efremenko, Vlada; Muliukha, Vladimir; Lukashin, Alexey; Verbova, Natalya

doi:10.3390/math13183040

Open AccessFeature PaperArticle

Survival Analysis as Imprecise Classification with Trainable Kernels

by

Andrei Konstantinov

,

Lev Utkin

,

Vlada Efremenko

,

Vladimir Muliukha

^*

,

Alexey Lukashin

and

Natalya Verbova

Higher School of Artificial Intelligence Technologies, Peter the Great St. Petersburg Polytechnic University, Polytechnicheskaya, 29, 195251 St. Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(18), 3040; https://doi.org/10.3390/math13183040

Submission received: 14 August 2025 / Revised: 6 September 2025 / Accepted: 15 September 2025 / Published: 21 September 2025

(This article belongs to the Special Issue Advanced Neural Network and Machine Learning Algorithms, Models and Architectures in Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Survival analysis is a fundamental tool for modeling time-to-event data in healthcare, engineering, and finance, where censored observations pose significant challenges. While traditional methods like the Beran estimator offer nonparametric solutions, they often struggle with the complex data structures and heavy censoring. This paper introduces three novel survival models, iSurvM (imprecise Survival model based on Mean likelihood functions), iSurvQ (imprecise Survival model based on Quantiles of likelihood functions), and iSurvJ (imprecise Survival model based on Joint learning), that combine imprecise probability theory with attention mechanisms to handle censored data without parametric assumptions. The first idea behind the models is to represent censored observations by interval-valued probability distributions for each instance over time intervals between event moments. The second idea is to employ the kernel-based Nadaraya–Watson regression with trainable attention weights for computing the imprecise probability distribution over time intervals for the entire dataset. The third idea is to consider three decision strategies for training, which correspond to the proposed three models. Experiments on synthetic and real datasets demonstrate that the proposed models, especially iSurvJ, consistently outperform the Beran estimator from accuracy and computational complexity points of view. Codes implementing the proposed models are publicly available.

Keywords:

survival analysis; censored data; classification; attention mechanism; imprecise probabilities; Beran estimator

MSC:

68T07; 62N02

1. Introduction

Survival analysis [1], or time-to-event analysis, provides a robust framework for modeling time-to-event data across diverse domains. In healthcare, it predicts critical outcomes such as mortality, disease recurrence, and recovery timelines while quantifying treatment effectiveness and disease progression. In engineering, the method assesses reliability, predicting system failures, optimizing maintenance schedules, and evaluating product durability. It also proves valuable in financial risk modeling (e.g., credit defaults and customer churn) and industrial contexts (e.g., equipment degradation and failure prediction). A key strength of survival analysis is its ability to handle censored data, making it indispensable for scenarios with incomplete event observations, from clinical trials to mechanical system monitoring. This approach delivers quantifiable insights into event timing uncertainty across these fields.

Due to the importance of tasks addressed by survival analysis, numerous machine learning models have been developed to handle time-to-event data and solve related problems within this framework [2,3,4,5,6,7,8]. Comprehensive reviews of these models are available in [9,10,11].

Many of these models adapt and extend traditional classification and regression techniques to handle the unique challenges of survival analysis, for example, random survival forests [12], survival support vector machines [13], neural networks [14], and survival-oriented transformers [15,16,17,18]. These methods demonstrate the growing intersection of classical survival analysis with modern machine learning paradigms.

Survival analysis often involves modeling time-to-event data in the presence of covariates or feature vectors. While traditional nonparametric methods like the Kaplan–Meier estimator [19] assume independence between survival times and covariates, conditional survival models incorporate covariate information to provide more personalized estimates. A prominent example is the Beran estimator [20], a nonparametric conditional survival function estimator that extends the Kaplan–Meier approach by incorporating kernel weighting based on covariates. Kernel-based methods, like the Beran estimator, provide flexible, nonparametric approaches to survival analysis by leveraging smoothing techniques to model time-to-event data with covariates. They offer a powerful toolkit for survival analysis, especially when parametric assumptions fail. Chen [21] presented a neural network framework based on kernels. The deep kernel conditional Kaplan–Meier estimator was introduced in [21]. Sparse kernel methods based on applying SVM were considered in [22]. Instead of estimating the survival function directly, some approaches smooth the cumulative hazard function [23]. A series of survival models based on applying Bayesian kernel methods which combine the flexibility of kernel-based nonparametric modeling with the probabilistic rigor of Bayesian inference [24]. Another kernel-based method is the kernel Cox regression [25,26,27] which relaxes the assumption of the linear covariate effects accepted in the original Cox model [28] by allowing nonlinear effects via kernel functions.

Survival analysis is traditionally framed as a time-to-event prediction problem, but it can also be approached as a classification task by discretizing time and predicting event probabilities over intervals in the framework of discrete-time survival analysis [29], where continuous survival data can be converted into a person–period format [30]. The loss functions in this case are based on applying the likelihood function [31]. A method for treating survival analysis as a classification problem was proposed in [32] where a “stacking” idea was used to collect features and outcomes of the survival data in a large data frame and then to solve the classification problem.

We propose a novel approach within the framework of discrete-time survival analysis, which can be characterized as imprecise survival modeling. The core idea is to model non-parametric probability distributions over time intervals for each instance in the training set. For uncensored observations, the distribution is degenerate, where the event’s true time interval has probability one while all others have probability zero. For censored observations, the probability of the event occurring in any interval after the censored time can range from zero to one, reflecting prior ignorance. This introduces imprecision, meaning each instance induces a set of possible probability distributions. To aggregate the probability distributions generated for each instance from the training set, we propose to employ kernel Nadaraya–Watson regression [33,34] with trainable dot-product attention weights [35,36]. We propose three training strategies, each leading to a distinct model. The first two strategies are based on generating random probability distributions from the produced sets of distributions. The first strategy averages all loss functions in the form of the likelihood function in accordance with the aggregated probability distributions. The corresponding model is called iSurvM (imprecise Survival model based on Mean likelihood functions). The second strategy averages only a portion of the largest values of the loss functions over the generated probability distributions. The corresponding model is called iSurvQ (imprecise Survival model based on Quantiles of likelihood functions). According to the third strategy, the probability distributions and the attention weights are jointly learned in a special way without using the generation procedure. The corresponding model is called iSurvJ (imprecise Survival model based on Joint learning). We avoid any assumptions associated with a survival model, for example, the Cox model, which define the loss functions for training the models (see, for example, [37]).

In contrast to many imprecise models for survival analysis [38,39,40], the proposed models do not consider the incomplete or partial knowledge in event time data. They represent censored observations as interval-valued or imprecise data.

Our contributions can be summarized as follows:

We propose three training strategies and the corresponding survival models, iSurvM, iSurvQ, iSurvJ, as an alternative to kernel-based models such as the Beran estimator.
The models can be trained using different attention mechanisms implemented by means of neural networks as well as simple Gaussian kernels (an additional model iSurvJ(G) which uses a Gaussian kernel for the attention weights). The number of trainable parameters depends on the key and query attention matrices and can be arbitrarily chosen.
Additionally, unlike Beran estimators, the proposed models impose no restrictions on the number of concurrent event times, making them more versatile. Moreover, a high proportion of censored data does not lead to significant accuracy degradation in the proposed models, unlike in Beran estimators.
No parametric assumptions are made in the proposed models.
Various numerical experiments with real and synthetic datasets are conducted to compare the proposed survival models with each other under different conditions and to compare them with the Beran estimator using the concordance index and the Brier score. We compare the proposed models with the Beran estimator, but not with available transformer-based models [16,41,42], because the introduced models are regarded as alternatives to the kernel-based Beran estimator and can be incorporated into a more complex model as a component. The corresponding codes implementing the proposed models are publicly available at: https://github.com/NTAILab/iSurvMQJ (accessed on 14 September 2025).

The paper is organized as follows. Related work devoted to machine learning models in survival analysis, to kernel-based methods and models in survival analysis, and to imprecise models in survival analysis can be found in Section 2. Section 3 provides basic definitions of survival analysis, the attention mechanism, and the Nadaraya–Watson regression. Survival analysis as an imprecise multi-label classification problem is considered in Section 4. Training strategies and survival models iSurvM, iSurvQ, iSurvJ are studied in Section 5. Numerical experiments with well-known public real and synthetic data illustrating the proposed models are given in Section 6. Concluding remarks are provided in Section 7.

2. Related Work

Machine learning models in survival analysis. Survival analysis, which deals with time-to-event data, has seen significant advancements through the integration of machine learning techniques. Numerous models have been developed to predict survival times, hazard functions, and other key measures, leveraging both traditional statistical methods and modern computational approaches. Several comprehensive reviews and surveys provide detailed insights into these developments. For instance, ref. [10] offers a broad examination of machine learning methods in survival analysis, while [9] and [43] discuss recent trends, including deep learning and high-dimensional data applications. Additionally, ref. [44] provides a methodological perspective on generalizing survival models, and [45] explores interpretability and benchmarking in survival machine learning. A more contemporary discussion on emerging techniques and challenges in the field can be found in [11].

Kernel-based methods and models in survival analysis. An important kernel-based approach to survival analysis is the Beran estimator [20], which extends the Kaplan–Meier estimator by incorporating covariate information through kernel smoothing. The Beran estimator or the conditional Kaplan–Meier estimator [46] enables robust estimation of survival functions in complex, high-dimensional settings. Recent advances have further enhanced these techniques through integration with deep learning. For instance, ref. [21] proposed a neural network framework that automatically learned optimal kernel functions for survival analysis, leading to the development of the deep kernel conditional Kaplan–Meier estimator, which improved adaptability to heterogeneous data structures.

Alternative kernel-based strategies include sparse kernel methods via support vector machines (SVMs) [22], which enhance computational efficiency, and approaches that focus on smoothing the cumulative hazard function rather than the survival function directly [23]. Bayesian formulations of kernel methods, such as those in [24], combine the flexibility of nonparametric modeling with the probabilistic rigor of Bayesian inference, offering uncertainty quantification alongside predictive accuracy.

Another significant direction is the kernel Cox regression [25,26,27], which generalizes the classical Cox model [28] by replacing linear covariate effects with nonlinear kernel-based relationships. This relaxation captures intricate dependencies in the data, addressing a key limitation of the original proportional hazards framework. Collectively, these methods demonstrate the versatility of kernel-based techniques in survival analysis, balancing interpretability, computational tractability, and adaptability to diverse data regimes.

Attention mechanism. Recent advances in survival analysis have leveraged attention mechanisms [36] to improve interpretability and predictive accuracy. Deep learning approaches use attention to model complex, nonlinear relationships. The corresponding methods address key challenges like censoring, competing risks, and non-proportional hazards while providing interpretable feature importance. The attention mechanism was applied to different deep learning survival models. In particular, a self-attention mechanism was used to model local and global context simultaneously for survival analysis in [47]. The attention mechanism can be a part of transformers, which capture long-range dependencies in survival data, offering improved performance in high-dimensional settings. Transformers for survival analysis with competing events were proposed in [48]. Transformers solving survival machine learning problems were also proposed in [16,41,42]. Different applications of transformers to medical problems in the framework of survival analysis can be found in [49,50,51,52].

Imprecise models in survival analysis. It should be noted that the imprecise survival analysis extends classical survival models to account for partial or incomplete knowledge in time-to-event data. Unlike traditional methods that assume precise probabilities, imprecise models incorporate set-valued or interval-based estimates to reflect epistemic uncertainty, ambiguity in censoring mechanisms, or conflicting expert opinions. A method of statistical inference based on the Hill model [53] for data that include right-censored observations was proposed in [39], where the authors derive bounds for the predictive survival function. Another method of statistical inference based on the imprecise Dirichlet model [54] was proposed in [38]. A method applying a robust Dirichlet process for estimating survival functions from samples with right-censored data was introduced in [40]. It should be note that the proposed imprecise models cope with cases when exact event times are unknown or are interval-valued and strong distributional assumptions are violated.

3. Preliminaries

3.1. Survival Analysis

Datasets in survival analysis are represented by a set

A = {(x_{1}, δ_{1}, T_{1}), \dots, (x_{N}, δ_{N}, T_{N})}

. Here, the vector

x_{i} \in R^{d}

consisting of d features characterizes the ith object. The time

T_{i}

corresponds to one of two types of event observations. The first type is when the event is observed. In this case, the observation is called uncensored and the censoring indicator

δ_{i}

is equal to one. The second type is when the event is not observed, it is greater than

T_{i}

, but its ‘true’ value is unknown. In this case, the observation is called censored and the censoring indicator

δ_{i}

is equal to zero. A survival model is trained on the set

A

to predict probabilistic measures of an event time T for a new object represented by the vector x.

An important concept in survival analysis is the survival function (SF)

S (t ∣ x)

. It is a function of time t defined as the probability of surviving up to time t, i.e.,

S (t ∣ x) = Pr {T > t ∣ x}

.

Many survival machine learning models have been developed in the last decades. In order to compare the models, special measures are used, different from the standard accuracy measures accepted in machine learning classification and regression models. The most popular measure in survival analysis is Harrell’s C-index (concordance index) [55]. It estimates the probability that, in a randomly selected pair of objects, the object that fails first had a worst predicted outcome. In fact, this is the probability that the event times of a pair of objects are correctly ranked. The C-index does not depend on choosing a fixed time for evaluation of the model and takes into account censoring of patients [56].

Let

J

be a set of all pairs

(i, j)

satisfying conditions

δ_{i} = 1

and

T_{i} < T_{j}

. The C-index is formally computed as proposed in [10,57]:

C = \frac{\sum_{(i, j) \in J} 1 [{\hat{T}}_{i} < {\hat{T}}_{j}]}{\sum_{(i, j) \in J} 1},

(1)

where

{\hat{T}}_{i}

and

{\hat{T}}_{j}

are predicted expected event times.

If the C-index is equal to one, then the corresponding survival model is supposed to be perfect. If the C-index is 0.5, then the model is not better than random guessing.

Another index used to compare survival models is the Brier Score (BS) [58,59] which is defined as

B S (t, \hat{S}) = E [{(Δ_{n e w} (t) - \hat{S} (t ∣ x_{n e w}))}^{2}],

(2)

where

Δ_{n e w} (t) = 1 (T_{n e w} > t)

is the true status of a new test subject, and

\hat{S} (t ∣ x_{n e w})

is the predicted survival probability.

The Integrated Brier Score (IBS) extends the BS by averaging it over a range of time points, typically from

t = 0

to a maximum time

t_{max}

. It is defined as:

I B S = \frac{1}{t_{max}} \int_{0}^{t_{max}} B S (t, \hat{S}) d t .

(3)

3.2. Attention Mechanism and the Nadaraya–Watson Regression

The idea of the attention mechanism can be clearly illustrated using the Nadaraya–Watson kernel regression model [33,34]. Consider a training set

{(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}

with N instances, where each

x_{i} \in R^{d}

is a feature vector and

y_{i} \in R

is its corresponding label. For a new input feature vector

x_{0}

, the regression output prediction

{\tilde{y}}_{0}

can be estimated as a weighted average using the Nadaraya–Watson kernel regression model [33,34]:

{\tilde{y}}_{0} = \sum_{i = 1}^{N} a_{0, i} (w) y_{i} .

(4)

Here,

a_{0, i} (w)

represents the attention weight with trainable parameters

w

, which quantifies the similarity (or distance) between the input feature vector

x_{0}

and the training feature vector

x_{i}

. The closer

x_{0}

is to

x_{i}

, the larger the corresponding weight

a_{0, i} (w)

becomes. In general, any distance or similarity function satisfying this monotonicity condition can serve as attention weights. A natural choice is the family of kernel functions, since a kernel K inherently acts as a similarity measure between vectors

x_{i}

and

x_{0}

. Therefore, the attention weights can be expressed as:

a_{0, i} (w) = \frac{K (x_{0}, x_{i}, w)}{\sum_{j = 1}^{N} K (x_{0}, x_{j}, w)} .

(5)

In the context of the attention mechanism [60], the vector

x_{0}

is referred to as the query, while vectors

x_{i}

and labels

y_{i}

are called the keys and values, respectively. The attention weights

a_{0, i} (w)

can be generalized by introducing trainable parameters. For example, using a Gaussian kernel with trainable parameter vector

w = (w_{1}, \dots, w_{n})

, the attention weight can be expressed as:

a_{0, i} (w) = σ (- {∥x_{0} - x_{i}∥}^{2} ∣ w) = \frac{exp (- w_{i} {∥x_{0} - x_{i}∥}^{2})}{\sum_{j = 1}^{N} exp (- w_{j} {∥x_{0} - x_{j}∥}^{2})},

(6)

where

σ (\cdot)

is the softmax function.

Several definitions of attention weights and corresponding attention mechanisms exist in the literature. Notable examples include the additive attention [60] and multiplicative or dot-product attention [35,36].

4. Survival Analysis as an Imprecise Multi-Label Classification Problem

This section demonstrates how survival models can be formulated using the Nadaraya–Watson kernel regression.

Consider a partition of the time axis into discrete intervals:

[0, \infty) = [0, t_{1}] \cup [t_{1}, t_{2}] \cup \dots \cup [t_{T - 2}, t_{T - 1}] \cup [t_{T - 1}, \infty),

(7)

such that

0 = t_{0} < t_{1} < t_{2} < \dots < t_{T - 1} < \infty .

(8)

The partition divides the time axis into T intervals denoted as

τ_{i} = [t_{i - 1}, t_{i}]

for

i = 1, \dots, T

. We assume that at each time point

t_{i}

(or within interval

τ_{i}

), there are

u_{i}

uncensored events and

c_{i}

right-censored events such that the total number of events is

N = \sum_{i = 1}^{T} (u_{i} + c_{i})

.

Let us define the probability

π_{k}^{(i)}

that an event corresponding to the ith object is observed in the interval

τ_{k}

. This generates a probability distribution for each subject

π^{(i)} = (π_{1}^{(i)}, \dots, π_{T}^{(i)})

,

i = 1, \dots, N

. If the corresponding observation is uncensored, then the following holds:

π_{j}^{(i)} = \{\begin{matrix} 0, & j \neq k, \\ 1, & j = k . \end{matrix}

(9)

For censored observations, the true event time must occur in some interval after

t_{k - 1}

, though the specific interval remains unknown. Consequently, the probability that the event is in each interval after time

t_{k - 1}

can be from zero to one, i.e.,

π_{j}^{(i)} = \{\begin{matrix} 0, & j < k, \\ [0, 1] & j \geq k . \end{matrix}

(10)

The interval

[0, 1]

is a form of prior ignorance. In all cases, we have the constraint

\sum_{j = 1}^{T} π_{j}^{(i)} = 1

for precise probabilities. However, this condition is not fulfilled when the upper or lower probabilities are summed. Due to imprecision of probabilities

π_{j}^{(i)}

, we assume that the ith instance produces a set of probability distributions

π^{(i)}

denoted as

R^{(i)}

.

Let us consider the following classification problem. There is a set of feature vectors

x_{1}, \dots, x_{N}

. The kth interval

[t_{k - 1}, t_{k}]

can be viewed as the kth class represented by the vector

π_{k} = {(π_{k}^{(1)}, \dots, π_{k}^{(N)})}^{T}

. Then, the Nadaraya–Watson kernel regression can be applied to find the class probability distribution

p (x_{0}) = (p_{1} (x_{0}), \dots, p_{T} (x_{0}))

of a new instance

x_{0}

. Since the probabilities

π_{k}^{(i)}

are interval values, then it is obvious that the predicted class probabilities

p_{1}, \dots, p_{T}

are also interval values. They are determined as follows:

p_{k} (x_{0}) = \sum_{i = 1}^{N} a_{0, i} (w) π_{k}^{(i)}, k = 1, \dots, T,

(11)

where

a_{0, i} (w)

is an attention weight expressed through kernels

K (x_{0}, x_{i}, w)

(see (5)),

i = 1, \dots, N

;

w = (w_{1}, \dots, w_{T})

is the trainable vector of parameters.

The attention weights satisfy the condition

\sum_{i = 1}^{N} a_{0, i} (w) = 1 .

(12)

5. Training Strategies and Three Survival Models

5.1. A General Form of Attention Weights

Before presenting the proposed survival models, their common elements in terms of attention weights and regularization are described. We consider the attention weights

a_{0, i} (w)

in a general form. Let

K \in R^{N \times d_{0}}

represent the matrix of the input keys (the matrix of N feature vectors

x_{i}

,

i = 1, \dots, N

, from the training set), and

Q \in R^{N \times d_{0}}

represent the matrix of the input queries (the matrix of the same feature vectors from the training set). Here,

d_{0}

is the dimension of the initial feature vectors

x_{i}

,

i = 1, \dots, N

. The matrix

Q

is used in the training phase. In many tasks, the sparse representation of feature vectors might be useful. In order to implement it, matrices

K

and

Q

can be transformed to sets of vectors of a new dimension

d \geq d_{0}

by using a neural network with parameters

θ

implementing functions

f_{θ} (K)

and

f_{θ} (Q)

as

K = f_{θ} (K_{0}) \in R^{N \times d}, Q = f_{θ} (Q_{0}) \in R^{N \times d},

(13)

where

f_{θ}

is a parameterized neural network trained with weights

θ

.

Let

W_{K} \in R^{d \times d}

and

W_{Q} \in R^{d \times d}

be the trainable matrices. Then, the matrix

A \in R^{N \times N}

of attention weights

a_{i j}

,

i = 1, \dots, N

,

j = 1, \dots, N

, can be computed as [35,36]:

A = \frac{({QW}_{Q}) ({KW}_{K})}{\sqrt{d}} .

(14)

Each element

a_{i j}

of

A

represents the similarity between the ith query and jth key. Matrices

W_{Q}

and

W_{K}

are randomly initialized in the training phase. The attention weights are also represented as a neural network with parameters

W_{Q}

and

W_{K}

.

To restrict certain interactions between elements of matrix

A

, we apply an attention mask

M \in {- \infty, 1}^{N \times N}

which is initialized with uniform random values

m_{i j} \sim U (0, 1)

. Then, it is binarized using a threshold

p_{m a s k}

(a hyperparameter) as follows:

m_{i j} = \{\begin{matrix} 1, & m_{i j} > p_{m a s k}, \\ - \infty, & otherwise, \end{matrix}, m_{i i} = - \infty, i = 1, \dots, N .

(15)

The mask is applied via element-wise multiplication:

A = A ⊙ M

. The row-wise softmax normalization yields the final attention weights:

w_{i, j} = \frac{exp (a_{i, j})}{\sum_{k} exp (a_{i, k})},

(16)

which comprise the final attention weight matrix

W \in R^{N \times N}

.

Hence, (11) can be written in a more general form:

p_{k} (x_{0}) = W π_{k}, k = 1, \dots, T,

(17)

In order to solve the above classification problem with interval-valued class probabilities, we have to define how to train the classifier when the class probabilities are interval-valued.

Let

Π = (π^{(1)}, \dots, π^{(N)})

and

R = R^{(1)} \cup \dots \cup R^{(N)}

. We introduce the loss function for solving the classification problem

L (Θ, Π)

, where

Θ

is a set of training parameters, which includes parameters

W_{K}

and

W_{Q}

from (14),

θ

from (13). One way of dealing with the loss function for a set of probabilities is to consider the following optimization problem:

min_{Θ} E_{Π \in R} L (Θ, Π) .

(18)

To handle the above optimization problem, we apply the Monte Carlo sampling scheme. The expectation is replaced with the sum of objective functions over a set of different probability distributions

π^{(1)}, \dots, π^{(N)}

taken from

R^{(1)}, \dots, R^{(N)}

, respectively. It can be done by generating M discrete probability distributions

q_{1}^{(i)}, \dots, q_{M}^{(i)}

for the ith instance from

R^{(i)}

. Here, M is a hyperparameter. If the ith observation is uncensored (

δ = 1

), then the probability distribution

π^{(i)}

is degenerate, having the unit value at the interval of time where the corresponding event occurred (see (9)). If the ith observation is censored (

δ = 0

), then probabilities are zero before the censoring interval. Probabilities of the subsequent intervals of time are interval-valued (see (10)). The corresponding distributions are randomly generated via the Dirichlet distribution [61]. There are also different approaches to generate points from the unit simplex [62]. As a result, we obtain a set of N matrices

S^{(i)} \in R^{M \times T}

,

i = 1, \dots, N

, such that each matrix contains M probability distributions

q_{1}^{(i)}, \dots, q_{M}^{(i)}

from

R^{(i)}

. It should be noted that we generate only distributions corresponding to interval-valued probabilities

π_{j}^{(i)}

. This implies that the distributions are generated only for censored observations; the unit simplex dimension depends on the time interval when the event occurs.

Let

c (i)

be the index of the time interval in which the event, corresponding to the ith instance, occurs. Then, the unit simplex dimension for generating a distribution is equal to

T - c (i)

.

Hence, the attention-weighted probability matrix for the ith instance is computed as:

P^{(i)} = {WS}^{(i)} \in R^{M \times T} .

(19)

The output

P^{(i)}

provides refined event probabilities across M generations and T intervals for the ith instance, where each element of

P^{(i)}

is computed as a weighted sum of probabilities from

S^{(i)}

:

p_{k}^{(m)} (x_{t}) = \sum_{i = 1}^{N} w_{k, i} S_{m, t}^{(i)} .

(20)

Here,

w_{k, i}

is an element of

W

;

p_{k}^{(m)} (x_{t})

is the kth element of the class probability distribution

p (x_{t})

for the tth instance obtained through the mth generation of the probability distribution from the set

R^{(t)}

;

S_{m, t}^{(i)}

is the element of the matrix

S^{(i)}

corresponding to the mth generation from

R^{(t)}

and the tth interval.

It is important to note that there are no generations for uncensored observations because we have the precise probability distribution (9). Generations are performed only for censored observations. In order to simplify the model description, we assume that M generations are performed for all instances. However, the software implementing the considered models distinguishes these cases.

5.2. First Model

The main idea behind the first proposed model, iSurvM, is based on averaging the loss function for learning the class probability distributions

p^{(m)} (x_{i})

over all generations with indices

m = 1, \dots, M

. The loss function based on using a likelihood function can be written in the following form:

l (p^{(m)} (x_{i})) = - 1 [δ_{i} = 1] \cdot log (p_{c (i)}^{(m)} (x_{i})) - 1 [δ_{i} = 0] \cdot log (\sum_{j = c (i) + 1}^{T} p_{j}^{(m)} (x_{i})),

(21)

where

p_{c (i)}^{(m)} (x_{i})

is the element of the class probability distribution

p^{(m)} (x_{i})

with index

c (i)

.

The first term in the loss function (21) corresponds to uncensored observations, and the second term is responsible for censored observations. The next step is averaging the loss functions

l (p^{(m)} (x_{i}))

over all generations:

L (p (x_{i})) = \sum_{m = 1}^{M} l (p^{(m)} (x_{i})) .

(22)

Finally, we obtain the expected loss function for learning the class probability distribution as follows:

L (p) = \sum_{i = 1}^{N} l (p (x_{i})) .

(23)

Learning the class probability distribution means that we learn attention weights minimizing the loss function

L (p)

because

p (x)

is a function of

π_{k}

(see (17)). After finalizing the attention weights

W

, we additionally learn the final class probability distribution. In this learning process, the probabilities are initialized as averaged generated values, and then they are optimized using loss function (21). This approach allows us to enhance model accuracy.

An example of implementation for training the model iSurvM is presented in Algorithm 1.

Algorithm 1 Training algorithm for iSurvM

Require:: Training set $A$ ; hyperparameters $p_{m a s k}$ , M, T, d; number of epochs E
Ensure:: Weight matrix $W$ ; probability distributions $π_{k}, k = 1, \dots, T$
1:: Initialize randomly matrices $W_{Q}$ , $W_{K}$ and compute the attention matrix $A$ using (14)
2:: Compute the mask matrix $M$ using (15)
3:: Compute the matrix $W$ using (16)
4:: while Number of epochs $\leq E$ do
5:: Generate probability distributions $q_{1}^{(i)}, \dots, q_{M}^{(i)}$ from $R^{(i)}$ for the ith instance using the Dirichlet distribution
6:: Form the matrix $S^{(i)}$ , $i = 1, \dots, N$
7:: Compute the matrix $P^{(i)}$ through $W$ and $S^{(i)}$ using (19), $i = 1, \dots, N$
8:: Learn $W$ using the loss functions (21)–(23)
9:: Compute the matrix $P^{(i)}$ through $W$ using (19)
10:: end while
11:: Fine-tune the averaged values of the probability distributions $q_{1}^{(i)}, \dots, q_{M}^{(i)}$ generated at the last epoch using the matrix $W$ obtained also at the last epoch and using loss function (23)

To find the class probability distribution

(p_{1} (x_{0}), \dots, p_{T} (x_{0}))

for a new instance

x_{0}

(the inference phase) based on the trained neural network with attention parameters

W_{Q}

and

W_{K}

, we use (17) and the trained matrix

W

, but the vectors of probabilities

π_{1}, \dots, π_{T}

in (17) are replaced with vectors

q_{1}^{(i)}, \dots, q_{M}^{(i)}

generated at the last epoch in Algorithm 1.

5.3. Second Model

The main idea behind the second proposed model, iSurvQ, is based on averaging the same loss function (21) for learning the class probability distributions

p^{(m)} (x_{i})

over all instances

i = 1, \dots, N

, i.e.,

L (p^{(m)} (x)) = \sum_{i = 1}^{N} l (p^{(m)} (x_{i})) .

(24)

Then, among all

L (p^{(m)} (x))

,

m = 1, \dots, M

, a portion r of the largest values of the loss functions are selected and averaged over the selected generations. Here, r is a hyperparameter. Let

K

be the set of indices corresponding to the

r \cdot M

largest values of

L (p^{(m)} (x))

. In fact, only the

r \cdot M

worst cases of the loss function are taken into account. This implies that the model is robust.

The final expected loss function for learning the class probability distribution is defined as follows:

L (p) = \sum_{m \in K} l (p^{(m)} (x)) .

(25)

In this model, we again learn the attention weights

W

, and then we learn the final class probability distribution in the same way as described for iSurvM. Therefore, the algorithm implementing the model is the same as Algorithm 1 except for the step when the attention weights

W

are trained. In iSurvQ, the matrix

W

is trained using loss functions (24) and (25) instead of (22) and (23).

5.4. Third Model

The main difference of the third proposed model, iSurvJ, is that the class probabilities and the attention weights are jointly learned. This learning allows us to avoid the generation procedure for the probability distributions

q_{1}^{(i)}, \dots, q_{M}^{(i)}

. To implement the learning, the probability logits are initially initialized as zeros. At each epoch, the

softmax (logits)

operation is applied to the logits to obtain probabilities of intervals

{\tilde{π}}_{k}^{(i)}

and ensuring the condition

\sum_{k = 1}^{T} {\tilde{π}}_{k}^{(i)} = 1

. Here,

{\tilde{π}}_{k}^{(i)}

are precise analogs of interval-valued probabilities

π_{k}^{(i)}

obtained by means of learning.

Let

p_{c (i)} (x_{i})

be the probability of the interval with the index

c (i)

defined using (11), but the probability

π_{k}^{(i)}

in (11) is replaced with the precise probability

{\tilde{π}}_{k}^{(i)}

. The loss function for learning the model is defined in (23). However, the loss function

l (p (x_{i}))

is determined by

\begin{matrix} l (p (x_{i})) & = - 1 [δ_{i} = 1] \cdot log (p_{c (i)} (x_{i})) \\ - 1 [δ_{i} & = 0] \cdot log (\sum_{j = c (i) + 1}^{T} p_{j} (x_{i})) - γ \sum_{k = 1}^{T} {\tilde{π}}_{k}^{(i)} \cdot log ({\tilde{π}}_{k}^{(i)}) . \end{matrix}

(26)

Here,

p (x_{i}) = (p_{1} (x_{i}), \dots, p_{T} (x_{i}))

is the precise probability distribution of the time intervals for the ith instance;

γ

is the hyperparameter. It can be seen from (26) that the regularization term is added to the loss function, which corresponds to the entropy of the probability distribution

{\tilde{π}}^{(i)}

. The same term is added to the loss functions (21) in iSurvM and iSurvQ when the last step in Algorithm 1 is performed (fine-tuning the averaged values of the probability distributions

q_{1}^{(i)}, \dots, q_{M}^{(i)}

). This regularization term prevents the model iSurvJ from making overly optimistic decisions when optimizing both the weights and probabilities at the same time.

An implementation of training the model iSurvJ is presented as Algorithm 2.

Algorithm 2 Training algorithm for iSurvJ

Require:: Training set $A$ ; hyperparameters $p_{m a s k}$ , T, d; number of epochs E
Ensure:: Weight matrix $W$ ; probability distributions ${\tilde{π}}_{k}, k = 1, \dots, T$
1:: Initialize randomly matrices $W_{Q}$ , $W_{K}$ and compute the matrix $A$ using (14)
2:: Compute the mask matrix $M$ using (15)
3:: Compute the matrix $W$ using (16)
4:: Initialize logits of ${\tilde{π}}_{k}^{(i)}$ as zeros
5:: while Number of epochs $\leq E$ do
6:: Compute ${\tilde{π}}_{k}^{(i)}$ using the softmax operation applied to logits y
7:: Compute the probability distribution $p (x_{i})$ using $W$ and ${\tilde{π}}_{k}^{(i)}$
8:: Learn $W$ and ${\tilde{π}}_{k}^{(i)}$ using the loss functions (26) and (23)
9:: Compute logits of ${\tilde{π}}_{k}^{(i)}$
10:: end while

For comparison, we also introduce the iSurvJ(G) model. It is identical to iSurvJ but uses a Gaussian kernel with temperature parameter

τ

to compute the attention weight, instead of the parameterized neural network representation in (13) and (14), which uses parameters

W_{K}

,

W_{Q}

, and

θ

.

For this model, we derive an error bound for the simplest case of a Gaussian kernel without trainable parameters (except for the learned probabilities

{\tilde{π}}^{(i)}

) in the form of Proposition 1. Determining tighter bounds is identified as a direction for future work.

Proposition 1

(iSurvJ(G) error). Model

p (x)

with parameters

{\tilde{π}}^{(i)}

minimizing loss (26) with the smallest

γ > 0

, for which the loss is strictly convex, has the error

∥p (x) - p^{*} (x)∥ = O (\sqrt{NW (N, T) + E_{γ} (N, T)}),

(27)

where

p^{*}

is the true probability distribution,

NW (N, T)

is the Nadaraya–Watson regression bounds for a sample size N with output dimensionality T, and

E_{γ} (N, T)

is the uniform generalization bound for the loss with the highest probability.

The proof of Proposition 1 is presented in Appendix A. Proposition 1 suggests a general framework for determining bounds for the model. Depending on the admissible assumptions, different bounds on

NW (N, T)

and

E γ (N, T)

can be applied. For example, the classical error bound for the Nadaraya–Watson regression with the Gaussian kernel of bandwidth h is of order

NW (N, T) = O (h^{2} + \sqrt{\frac{log (1 / δ)}{N h^{d}}})

[63]. Meanwhile, the pointwise error bound for the leave-one-out log-likelihood estimation of

\tilde{Π}

under simplex constraints is at least of order

E γ (N, T) = O (\sqrt{\frac{log (1 / δ)}{N}})

[64]. Therefore, in this case, the overall error is dominated by

E_{γ} (N, T)

and is largely independent of the kernel properties. Deriving more precise bounds for this error term remains an important objective for future research.

5.5. Extended Intervals

Let us consider a dataset that contains only uncensored instances, where all event times are unique. Recall that the attention mask

M

in (15) is applied to the attention matrix to prevent examples from attending to themselves. For uncensored examples (with

δ = 1

), a constraint is imposed on the probabilities: the probability is set to 1 in the interval containing the actual event time, and 0 in all other intervals. This results in a probability vector of the form

[0, \dots, 1, \dots, 0]

. Since all event times are unique (and hence the probability vectors as well), the only example that could influence the learning of a given example is the example itself. However, self-attention is removed by the mask, and thus the predicted probability of falling into the correct interval, which is computed as a weighted sum of probabilities, becomes 0 (since all the weighted terms are masked out).

To fix this issue, instead of considering only the probability of the event in the exact interval, we also consider the neighboring

2 k

intervals (looking at k intervals on right and k intervals on left) and sum their probabilities. That is, the first term in the loss function, which was

- I [δ_{i} = 1] \cdot log (S (t_{c (i)}) - S (t_{c (i) + 1}) = - I [δ_{i} = 1] \cdot log (p_{c (i)}),

(28)

is replaced with

- I [δ_{i} = 1] \cdot log (S (t_{c (i) - k}) - S (t_{c (i) + 1 + k}) = - I [δ_{i} = 1] \cdot log (\sum_{j = c (i) - k}^{c (i) + 1 + k} p_{j}),

(29)

where k is a model hyperparameter that determines how many intervals around the true event time are considered for uncensored data;

S (t_{c (i)})

is the SF defined through probabilities

p_{j}

.

As a result, a total of

2 k + 1

intervals are used for analyzing each time interval

τ_{i}

.

This modification is applied to all the considered models and ensures that a model considers not only the exact event interval but also adjacent intervals, making training more robust and preventing zero gradients due to masking.

6. Numerical Experiments

The code implementing the proposed models is publicly available at https://github.com/NTAILab/iSurvMQJ (accessed on 14 September 2025).

To rigorously evaluate the performance of the proposed models, we conducted extensive experiments on 11 publicly available survival analysis datasets, including Veterans, AIDS, Breast Cancer, WHAS500, GBSG2, BLCD, LND, GCD, CML, Rossi, and METABRIC.

In addition, we studied the models on synthetic datasets: Friedman1,2,3, Linear, Quadratic, Strong Feature Interactions, Sparse Features, Nonlinear, and Noisy. The number of instances in the synthetic datasets for most experiments was 500 in the training set and 300 in the test set. The noise parameter in the datasets was

ϵ = 5

, except for the noisy data, where

ϵ = 1

. The proportion of censored data in the datasets was

0.2

. Prior to model training, all numerical features were standardized using z-score normalization, while categorical features were encoded using one-hot encoding to ensure compatibility with the machine learning algorithms.

The experimental design employed a nested cross-validation approach to ensure robust performance estimation. The outer loop consisted of four iterations of 5-fold stratified cross-validation, preserving the distribution of censored events in each fold, with shuffling enabled and random seeds fixed for reproducibility. Within each training fold of the outer loop, an additional three-fold stratified cross-validation was performed for hyperparameter optimization.

We compared the performance of five models: the proposed models iSurvM, iSurvQ, iSurvJ, and iSurvJ(G), along with the baseline Beran estimator. Hyperparameter tuning was conducted using the Optuna library [65], which implements Bayesian optimization. Each model was configured with a tailored search space encompassing key parameters such as embedding dimensions (ranging from 64 to 128), learning rates (from

10^{- 4}

to 1), and regularization coefficients for weights (from

10^{- 6}

to

5 \cdot 10^{- 2}

), and the entropy regularization (from

10^{- 6}

to 3). Additional architectural and training parameters included dropout rates (

0.3

to

0.8

), mask rates (

0.1

to

0.5

), batch sizes (10% to 100% of the data), and epoch counts (20 to 2000). The interval width parameter k, which determines the granularity of predictions for uncensored examples, was varied between 3 and 10 across different training instances.

Model performance was assessed using two well-established metrics in survival analysis: the C-index, which measures the ranking consistency of predicted risk scores, and the IBS, which evaluates the accuracy of probabilistic predictions over time.

6.1. Description of Synthetic Data

To evaluate the performance of the survival analysis models, we generated several synthetic datasets with varying levels of complexity, interactions, and noise. These datasets were designed to challenge models in capturing nonlinear dependencies, feature interactions, and the effects of censoring.

The censored data for all synthetic data were generated randomly according to the Bernoulli distribution with some probabilities

Pr {δ_{i} = 0}

and

Pr {δ_{i} = 1}

. Event times were generated in accordance with the Weibull distribution with the shape parameter

k > 0

and had the following form:

T = \frac{y}{Γ (1 + \frac{1}{k})} \cdot {(- log (u))}^{\frac{1}{k}},

(30)

where u is a random number uniformly distributed in the interval

[0, 1]

; y is a value that reflects a specific underlying relationship between features and event time and was computed individually for each dataset as shown below;

Γ (\cdot)

is the gamma function.

The Friedman1 dataset defines the variable y as a nonlinear function of the input features

x = (x_{1}, x_{2}, x_{3}, x_{4}, x_{5}) \in R^{5}

, generated according to the following formula:

y = 10 sin (π x_{1} x_{2}) + 20 {(x_{3} - 0.5)}^{2} + 10 x_{4} + 5 x_{5},

(31)

where covariates

x

are independently drawn from the uniform distribution on the interval

[0, 1]

.

The remaining

d - 5

features (if

d > 5

) are considered as noise variables and do not influence the value of y.

The Friedman2 dataset defines the variable y through a geometric relationship involving an ellipsoidal structure. The response is computed as:

y = \sqrt{x_{1}^{2} + {(x_{2} x_{3} - \frac{1}{x_{2} x_{4}})}^{2}},

(32)

where the input features

x = (x_{1}, x_{2}, x_{3}, x_{4}) \in R^{4}

are independently sampled from the uniform distribution on the interval

[0, 1]

. The features are then scaled and shifted as follows to introduce variability and align with the desired characteristics of the dataset:

x_{1}

is scaled by a factor 100;

x_{2}

is scaled by

520 π

and shifted by

40 π

;

x_{4}

is scaled by 10 and then shifted by 1.

That function introduces strong nonlinear interactions and is useful for evaluating the performance of algorithms in capturing complex geometric dependencies between features.

The Friedman3 dataset defines the variable y as a highly nonlinear function involving division, multiplicative interactions, and a trigonometric transformation. Specifically, y is computed as follows:

y = arctan (\frac{x_{2} x_{3} - 1 / (x_{2} x_{4})}{x_{1}}) .

(33)

As with the previous dataset, the features

x = (x_{1}, x_{2}, x_{3}, x_{4}) \in R^{4}

are generated by the same way.

This function introduces complex interactions between variables and a strong nonlinear transformation via the arctangent function, making it a challenging benchmark for models.

To evaluate the models’ capability to handle strong feature interactions, we generated a dataset where the variable y was determined as a sum of the pairwise feature interactions:

y = \sum_{i = 1}^{d} \sum_{j = i + 1}^{d} w_{i j} x_{i} x_{j},

(34)

where

x = (x_{1}, x_{2}, \dots, x_{d}) \in R^{d}

is a feature vector, with each component independently drawn from the uniform distribution;

W = {w_{i j}} \in R^{d \times d}

,

i, j = 1, \dots, d

, is a symmetric matrix of interaction coefficients with a zero diagonal, i.e.,

w_{i j} = w_{j i}

,

w_{i i} = 0,

and for

i < j

, the upper-triangular elements are independently sampled from a uniform distribution:

w_{i j} \sim U (0, 1)

.

This formulation ensures that the target variable y is entirely governed by second-order interactions.

The Sparse feature dataset represents a scenario with high-dimensional sparse data. Features are generated as a sparse matrix

X \in R^{n \times d}

, where n is the number of samples, d is the number of features, the entries of X are defined such that only some proportion

s \in (0, 1)

of them are non-zero, with non-zero values sampled from a uniform distribution:

x_{i j} \sim \{\begin{matrix} U (0, 1), & with probability s, \\ 0, & with probability 1 - s . \end{matrix}

(35)

The variable

y \in R^{n}

is modeled as a linear function of the features:

y = X w,

(36)

where

w \in R^{d}

is a coefficient vector. The elements of

w

are generated using

w_{j} \sim U (0, 1)

. This dataset design challenges models to handle sparsity in the feature space, as most inputs contain limited information.

The Nonlinear dependency dataset is designed to model complex, highly nonlinear relationships between input features and the target variable. The variable y was defined as:

y = 4 sin (x_{1}) + log (| x_{2} | + 1) + x_{3}^{2} + e^{0.5 x_{4}} + tanh (x_{5}) + \sum_{j = 6}^{d} f_{j} (x_{j}),

(37)

where

x = (x_{1}, x_{2}, \dots, x_{d}) \in R^{d}

is a feature vector with

d \geq 5

; each component

x_{j} \sim U (0, 1)

is sampled independently;

f_{j} (x_{j})

are additional nonlinear functions applied to the remaining features

j = 6, \dots, d

, which may be chosen from a predefined set of nonlinear transformations (e.g.,

sin (x_{j}) \cdot \sqrt{| x_{j - 1} | + 1}

,

log (| x_{j} | + 1) \cdot tanh (x_{j - 2})

,

x_{j}^{2} \cdot cos (x_{j - 3})

) to increase the overall complexity of the target variable.

This construction creates a highly nonlinear mapping from features to the target, requiring models to capture a variety of nonlinear behaviors including oscillatory, exponential, polynomial, and saturation effects.

In the Noisy dataset, the target variable T is generated by applying a transformation

y = X w

using the Weibull distribution with a small shape parameter

k \leq 1

, leading to high-variance noise. Here,

X \in R^{n \times d}

is a feature matrix with entries

x_{i j} \sim U (0, 1)

; and

w \in R^{d}

is a coefficient vector. Elements of

w

are generated using

w_{j} \sim U (0, 1)

.

By setting

k \leq 1

, the resulting distribution of T becomes heavy-tailed and highly variable, which significantly obscures the underlying linear relationship encoded by y, thus testing a model’s robustness to noise.

The Linear dataset is designed to model simple linear dependencies between features and the target variable. The variable y is generated as follows:

y = \sum_{i = 1}^{d} w_{i} x_{i} = x^{⊤} w,

(38)

where

x = {(x_{1}, x_{2}, \dots, x_{d})}^{⊤} \in R^{d}

is a feature vector with components sampled independently from the uniform distribution

x_{i} \sim U (0, 1)

;

w = {(w_{1}, w_{2}, \dots, w_{d})}^{⊤} \in R^{d}

is a vector of weights with coefficients also sampled independently from a uniform distribution

w_{i} \sim U (0, 1)

.

This setup ensures that the target variable y is a pure linear function of the input features, serving as a baseline for evaluating a model’s ability to learn linear relationships.

The Quadratic dataset is designed to capture second-order (quadratic) dependencies between features. The variable is constructed as a quadratic form of the input features:

y = x^{⊤} Q x,

(39)

where

x = {(x_{1}, x_{2}, \dots, x_{d})}^{⊤} \in R^{d}

is a feature vector with components independently sampled from the uniform distribution,

x_{i} \sim U (0, 1)

;

Q \in R^{d \times d}

is a symmetric positive semi-definite matrix that defines the curvature of the quadratic form

Q = A^{⊤} A

;

A \in R^{d \times d}

is a matrix with entries sampled from the standard normal distribution

A_{i j} \sim N (0, 1)

.

This formulation ensures that Q is symmetric and positive semi-definite, thus guaranteeing

y \geq 0

for all

x

, and that the target variable captures interactions between features in a smooth, convex manner.

6.2. Description of Real Data

The Veterans’ Administration Lung Cancer Study (Veteran) Dataset includes data on 137 patients characterized by six features. It is available through the R package “survival”, version 3.8-3.

The AIDS Clinical Trials Group Study (AIDS) Dataset contains healthcare and categorical data for 2139 patients, all diagnosed with AIDS, and are described by 23 features. It can be accessed on Kaggle: https://www.kaggle.com/datasets/tanshihjen/aids-clinical-trials (accessed on 14 September 2025).

The Breast Cancer Dataset consists of 198 samples described by 80 features. The endpoint is the occurrence of distant metastases, which are observed in 51 patients (25.8%). It can be obtained from the Python library “scikit-survival”version 0.25.0.

The Worcester Heart Attack Study (WHAS500) Dataset considers 500 patients with 14 features. The dataset can be obtained via the “smoothHR” R package version 1.0.5 or the Python “scikit-survival” package version 0.25.0.

The German Breast Cancer Study Group 2 (GBSG2) Dataset includes observations from 686 patients with 10 features. It is accessible via the “TH.data” R package version 1.1-4.

The Bladder Cancer Dataset (BLCD) includes observations of 86 patients after surgery assigned to placebo or chemotherapy. It has two features and can be obtained from https://www.stat.rice.edu/~sneeley/STAT553/Datasets/survivaldata.txt (accessed on 14 September 2025).

The Lupus Nephritis Dataset (LND) consists of observations of 87 persons with lupus nephritis, followed for over 15 years after an initial renal biopsy. This dataset only contains time to death/censoring, indicator, duration, and log(1 + duration), where duration is the duration of untreated disease prior to biopsy. The dataset is available at http://www.stat.rice.edu/~sneeley/STAT553/Datasets (accessed on 14 September 2025).

The Gastric Cancer Dataset (GCD) includes observations of 90 patients with four features. It is available through the R package “coxphw”version 4.0.3.

The Chronic Myelogenous Leukemia Survival (CML) Dataset is simulated according to the structure of data by the German CML Study Group. The dataset consists of 507 observations with seven features: a factor with 54 levels indicating the study center; a factor with levels trt1, trt2, trt3 indicating the treatment group; sex (0 = female, 1 = male); age in years; risk group (0 = low, 1 = medium, 2 = high); censoring status (FALSE = censored, TRUE = dead); and time survival or censoring time in days. The dataset can be obtained via the “multcomp” R package (cml) version 1.4-28.

The Rossi Dataset contains 432 convicts released from Maryland state prisons in the 1970s, described by 62 features. The dataset can be obtained via the “RcmdrPlugin.survival” R package version 1.3-2.

The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) Database contains records of 1904 breast cancer patients, including nine gene indicators and five clinical features. The dataset is available at https://www.kaggle.com/datasets (accessed on 14 September 2025).

6.3. Study of the Model Properties

Table 1 shows the C-indices obtained for real datasets by using the Beran estimator, iSurvM, iSurvQ, iSurvJ, and iSurvJ(G). It can be seen from the presented results that the proposed models outperformed the Beran estimator for almost all datasets. The results imply that iSurvJ and iSurvJ(G) provided better results in comparison with other models. Moreover, the worst results were observed for iSurvM. The same conclusion can be made for real datasets based on the IBS as shown in Table 2.

Numerical results for real data showed that the best models were iSurvJ and iSurvJ(G). Therefore, we studied the properties of these models on the basis of synthetic data.

Based on the results in Table 1 and Table 2, we formally demonstrate the superiority of iSurvJ over the Beran model using the Wilcoxon signed-rank test. This test, which ranks performance differences between two models, was advocated by Demsar [66] for determining statistical significance in classifier comparisons. The test returned a p-value of

0.032

for the C-index and

0.042

for the IBS for this model pair. Since both p-values are below the

0.05

significance level, we conclude that iSurvJ’s superiority is statistically significant. While other models also generally performed better than the Beran estimator, their improvements were not statistically significant.

6.3.1. Dependence on the Number of Features

Using synthetic datasets, we studied how the number of features impacted the accuracy measures. We applied the models to the Linear, Quadratic, Strong Feature Interactions, Sparse Features, Nonlinear, Noisy datasets. The number of features ranged from 1 to 10 in increments of 1. The features were standardized. The parameters of iSurvJ were as follows: the number of epochs was 300, the learning rate was

10^{- 2}

,

γ = 0.1

, the mask rate was

0.5

, the regularization coefficient for weights was

2 \cdot 10^{- 3}

, the embedding dimension was 64, the batch_rate was

0.2

, and the dropout rate was

0.5

. Parameters of the Beran estimator: a Gaussian kernel with temperature parameter

τ = 0.1

. For each number of features in one dataset, 50 experiments were conducted.

Figure 1 and Figure 2 show the dependence of the C-index and IBS on the number of features. It can be seen from Figure 1 and Figure 2 that iSurvJ demonstrated better results compared to the Beran estimator on all considered synthetic datasets. Moreover, the greater the number of features in the instances, the more significant the difference between the proposed model and the Beran estimator. This implies that the proposed model is recommended for datasets with high-dimensional feature vectors.

6.3.2. Dependence on Parameter k

The next question was how parameter k in (29) impacted the model accuracy. We used the Friedman1, Friedman2, and Friedman3 synthetic datasets. The parameters of the model iSurvJ were the same as in the previous experiments. The number of features was five. The proportion of censored data in the dataset was

0.2

. Values of the parameter k varied from 0 to 20 in increments of 1.

Results are shown in Figure 3. It can be seen from Figure 3 that the model’s accuracy significantly improves as the parameter k increases, but beyond a certain threshold, the improvement plateaus. At the same time, increasing k also raises the computational complexity. Therefore, beyond a certain value, further increasing k becomes pointless.

6.3.3. Dependence on the Number of Censored Observations

Another study question was how the model accuracy depended on the number of censored instances. We used the Strong Feature Interactions and Sparse Features synthetic datasets with five features. The proportion of censored data in each dataset varied from 0 to

0.8

in increments of

0.1

. The parameters of iSurvJ were the same as in the previous experiments.

Figure 4 shows the dependence of the C-index and IBS on the proportion of censored data. The experimental results primarily illustrate how the accuracy of the proposed model decreased as the proportion of censored data increased.

The comparative experiments are illustrated in Figure 5, Figure 6 and Figure 7, which show the proposed model compared to the Beran estimator as the proportion of censored data increases. The average Kolmogorov–Smirnov (KS) distances between SFs obtained by these models for the same instances are also shown, depending on the proportion of censored data. The comparison results are provided for the Linear, Quadratic, Strong Feature Interactions, Friedman1, Friedman2, Friedman3, Sparse Features, Nonlinear, and Noisy datasets. The key feature of these comparative experiments was that instead of the iSurvJ model, we used the iSurvJ(G) model, where the neural network attention mechanism was replaced with a Gaussian kernel featuring a learnable parameter. It is interesting to point out from Figure 5, Figure 6 and Figure 7 that iSurvJ(G) outperformed the Beran estimator especially when the number of censored instances was large. This is clearly demonstrated by the KS distances, which show how the difference between SFs of two models increases with the proportion of censored data.

6.3.4. Intervals for Survival Functions

The next experiments illustrated intervals of SFs depicted in Figure 8 and Figure 9. The plots are shown for one instance from a dataset (similar patterns can be observed for other instances and other datasets). One SF was produced as the prediction for this instance by the iSurvJ(G) model. Another SF was obtained as a prediction using the Beran estimator.

The bounds for the SF were obtained using (11), where probabilities

π_{k}^{(i)}

,

k = 1, \dots, T

are interval-valued for censored observations. The attention weights

a_{j, i} (w)

were computed by using Gaussian kernels. As a result, we obtained the interval-valued probability distribution

p_{k} (x_{j})

,

k = 1, \dots, T

, for the jth instance, which produced the interval-valued SFs, depicted in Figure 8 and Figure 9 for instances from the Friedman1 and Nonlinear datasets, respectively.

It is important to note that the Beran estimator predicted an SF which is totally inside the lower and upper SF bounds. This is a very interesting observation.

6.3.5. Unconditional Survival Functions

Let us study the relationship between the unconditional SFs produced by iSurvJ and the Kaplan–Meier estimator. The unconditional SF was calculated by averaging all conditional SFs obtained for all instances from a dataset. We considered the following real datasets: Veterans Lung Cancer, GBSG2, WHAS500, and Breast Cancer. Numerical features were standardized; categorical features were encoded. the parameters of iSurvJ were as follows: the number of epochs was 1000, the learning rate was

10^{- 2}

,

γ = 0.1

, the mask rate was 0, the regularization coefficient for weights was

2 \cdot 10^{- 3}

, the embedding dimension was 64, the batch_rate was

0.1

, and the dropout rate was

0.6

.

Figure 10 and Figure 11 illustrate the unconditional SFs obtained by iSurvJ and the Kaplan–Meier estimator. One can see from the plots that the SFs of both models are very close to each other.

6.3.6. An Illustrative Comparative Example of iSurvJ(G) and Beran Estimator

To demonstrate the quality of the proposed approach, a synthetic dataset was generated in which the true relationship between a feature and the time to event was given by the following parabolic function:

T = {(x - x_{0})}^{2},

where

x_{0} = 0

corresponds to the position of the minimum.

Features

x \in R

were generated to ensure a non-uniform observation density in different intervals: 200 points uniformly distributed in the interval

[- 5, - 2]

, 10 points located in the interval

[- 2, 2]

, and 200 points uniformly distributed in the interval

[2, 5]

. Thus, the total number of observations was

N = 410

, with the central region corresponding to the smallest time-to-event values artificially thinned to reduce event density.

For the left and right parts of the sample, the censoring indicator

δ \in {0, 1}^{n}

was determined randomly according to the Bernoulli scheme with probability

p_{cens} = 0.1

. For the central part of the parabola containing 10 observations, the censoring indicators were set manually, alternately taking values zero and one, which ensured diversity of observations in the region of the minimum. The iSurvJ(G) model was used in the experiment with 50 epochs, a learning rate of 7,

γ = 0.2

,

k = 5

, a batch_rate of 1. In the Beran estimator, the same kernel as in iSurvJ(G) was used.

Results of the experiment are shown in Figure 12, which clearly demonstrates that the Beran estimator exhibits unstable behavior and fails to learn effectively from the given data, while the proposed model iSurvJ(G) with the same kernel successfully handles the complex data structure.

6.4. Some Conclusions from Numerical Experiments

The proposed models consistently outperformed the Beran estimator across synthetic datasets (Linear, Quadratic, Strong Feature Interactions, etc.), with the performance gap widening as the number of features increased (see Figure 1 and Figure 2). This suggests that iSurvJ is better suited for datasets with complex, high-dimensional feature structures. As the proportion of censored data increased, the Beran estimator exhibited instability, while iSurvJ(G) maintained robust performance (see Figure 5, Figure 6 and Figure 7). The Kolmogorov–Smirnov (KS) distances between SFs further confirmed that iSurvJ(G)’s predictions diverged significantly from the Beran estimator under heavy censoring, highlighting its adaptability.

The interval-valued SFs generated by the interval-valued representation of the instance probability distributions over the time intervals (see Figure 8 and Figure 9) encapsulated the Beran estimator’s predictions within the bounds, suggesting that the proposed model provides a more comprehensive representation of uncertainty, especially for censored observations.

The parameter k, controlling the neighborhood of intervals for uncensored data, initially improved accuracy but plateaued beyond a threshold (see Figure 3). This indicates a trade-off between the computational cost and marginal gains in predictive performance.

The unconditional SFs produced by iSurvJ closely matched those of the nonparametric Kaplan–Meier estimator (see Figure 10 and Figure 11), validating its consistency with classical methods while offering conditional predictions.

To compare the performance of iSurvJ(G) and the Beran estimator, we presented an illustrative example using a synthetic dataset generated with a parabolic time-to-event function (see Figure 12). The example demonstrated unstable behavior of the Beran estimator trained on specific complex data, whereas iSurvJ(G) with the same kernel successfully captured the underlying data structure.

In a nutshell, the experimental results demonstrated the superior performance of the proposed iSurvJ and iSurvJ(G) models compared to the traditional Beran estimator, particularly in high-dimensional and censored-data scenarios.

It should be noted that the models’ performance relies on hyperparameter tuning (e.g., embedding dimensions, regularization coefficients), though Optuna-based optimization mitigated this. While iSurvJ(G) uses a simpler Gaussian kernel, the neural network-based iSurvJ may face scalability issues with large datasets.

The main challenge with comparing our model to available transformer-based or deep learning survival models is that they require large volumes of data. For example, the authors of DeepHit [67] trained their model on the SEER dataset, which consists of 72,809 patients, or on a SYNTHETIC dataset of 30,000 patients. It is difficult to train deep learning survival models on small datasets due to the risk of overfitting. However, we compared our results with those of different models trained on the METABRIC dataset. The best corresponding C-index results, as presented in [16], were as follows: DeepSurv [68] 0.645; DeepHit 0.636; transformer-based deep survival model [16] 0.640. Models iSurvJ 0.647; iSurvJ(G) 0.6407. These results show that iSurvJ achieved the best performance, while the iSurvJ(G) variant remained competitive with the other models.

As a computational complexity example, we compared models iSurvJ and iSurvJ(G) trained on the synthetic Quadratic dataset. For a task with 500 instances and five features using 200 epochs, the learning time for iSurvJ(G) was 505 ms, compared to 808 ms for iSurvJ. The neural network implementing the function

f_{θ}

(see (13)) produced an embedding of size 64 for both models. The results showed a significant difference in learning time. However, their inference times were found to be approximately equal.

7. Conclusions

This work has introduced three novel survival models: iSurvM, iSurvQ, and iSurvJ, leveraging the imprecise probability theory and attention mechanisms to address censored-data challenges. One of the contributions is flexible modeling which means that the proposed framework avoids parametric assumptions and accommodates interval-valued probabilities for censored instances, enabling richer uncertainty quantification. We also have to point out that iSurvJ and its Gaussian-kernel variant iSurvJ(G) consistently outperformed the Beran estimator, particularly in high-dimensional settings and under heavy censoring.

The attention weights can be interpreted as a quantitative contribution measure within the example-based explanation framework [69]. In this approach, the most similar instances to a given sample are retrieved from the dataset to explain the model’s output—in this case, the survival function. This provides an advanced method for identifying influential instances, directly answering the question: “Which training instances were most responsible for this specific prediction?”

It should be noted that the proposed models deal with the standard survival analysis task. At the same time, it is interesting to extend it to competing risks and time-varying covariates. These are directions for further research. Moreover, we considered only small and middle-dimensional data. Another important direction for further research is to investigate scalability enhancements for ultra-high-dimensional data.

An extreme case of imprecision is when probabilities of time-to-event for the censored observations are in intervals from zero to one. There exist imprecise models and approaches which allow us to reduce the intervals employing additional assumptions, for example, applying definitions of reachable probability intervals [70] or survival models based on imprecise Dirichlet distribution [38].

Author Contributions

Conceptualization, L.U., A.K. and V.M.; methodology, L.U., V.M. and N.V.; software, A.K., V.E. and A.L.; validation, A.K., V.E., N.V., V.M. and A.L.; formal analysis, A.K. and L.U.; investigation, L.U., A.K., N.V. and V.M.; resources, L.U. and V.M.; data curation, V.M., A.L. and V.E.; writing—original draft preparation, L.U., N.V. and A.K.; writing—review and editing, L.U., N.V. and V.M.; visualization, V.E. and A.K.; supervision, L.U.; project administration, V.M.; funding acquisition, L.U. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Russian Science Foundation under grant 25-11-00021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in several public repositories. Data were derived from the resources available in the public domain and described in Section 6.2. [AIDS] [https://www.kaggle.com/datasets/tanshihjen/aids-clinical-trials] (accessed on 14 September 2025).

Acknowledgments

The authors would like to express their appreciation to the anonymous referees whose very valuable comments have improved the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Proposition 1.

First, decompose the norm:

∥p (x_{0}) - p^{*} (x_{0})∥ \leq ∥\sum_{i = 1}^{N} α_{0, i} {\tilde{π}}^{(i)} - \sum_{i = 1}^{N} α_{0, i} p^{*} (x_{i})∥ + ∥\sum_{i = 1}^{N} α_{0, i} p^{*} (x_{i}) - p^{*} (x_{0})∥ .

(A1)

The second term is

O (NW (N, T))

as it is a classical Nadaraya–Watson’s error.

The first term can be bounded as follows:

∥\sum_{i = 1}^{N} α_{0, i} {\tilde{π}}^{(i)} - \sum_{i = 1}^{N} α_{0, i} p^{*} (x_{i})∥ \leq \sum_{i = 1}^{N} α_{0, i} ∥{\tilde{π}}^{(i)} - p^{*} (x_{i})∥ \leq max_{i \in {1, \dots, N}} ∥{\tilde{π}}^{(i)} - p^{*} (x_{i})∥ .

(A2)

It remains to derive the error bound on

∥{\tilde{π}}^{(i)} - p^{*} (x_{i})∥

.

We find the vector

{\tilde{π}}^{(i)}

as a minimizer of the loss function (26), which is a negative log-likelihood calculated for the kernel regression predictions. It can be seen that since

p (x_{i})

estimates the vector

{\tilde{π}}^{(i)}

as a Nadaraya–Watson regressor, then

∥ {\tilde{π}}^{(i)} - p (x_{i}) ∥ = O (NW (N, T))

, the loss is Lipschitz continuous (given sufficient

γ > 0

), and the following holds

l ({\tilde{π}}^{(i)}) = l (p (x_{i}) + {\tilde{π}}^{(i)} - p (x_{i})) = l (p (x_{i})) + O (NW (N, T)) .

(A3)

Then, taking into account that

p (x_{1}), \dots, p (x_{N})

minimize the empirical loss function,

p^{*} (x)

minimizes the expected loss function, and the generalization bound assumption holds

l (p (x_{i})) - l (p^{*} (x_{i})) = O (E_{γ} (N, T)),

(A4)

therefore, for the expected loss functions, we obtain

l ({\tilde{π}}^{(i)}) - l (p^{*} (x_{i})) = O (NW (N, T) + E_{γ} (N, T)) .

(A5)

By using the Pinsker’s inequality, we write

{∥{\tilde{π}}^{(i)} - p^{*} (x_{i})∥}^{2} \leq 2 |l ({\tilde{π}}^{(i)}) - l (p^{*} (x_{i}))| .

(A6)

Combining the above, we obtain

∥{\tilde{π}}^{(i)} - p^{*} (x_{i})∥ = O (\sqrt{NW (N, T) + E_{γ} (N, T)}) .

(A7)

Substituting this back into (A2) establishes the claimed rate for

∥p (x_{0}) - p^{*} (x_{0})∥

, since

O (\sqrt{NW (N, T)})

dominates

O (NW (N, T))

. □

References

Hosmer, D.; Lemeshow, S.; May, S. Applied Survival Analysis: Regression Modeling of Time to Event Data; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Jing, B.; Zhang, T.; Wang, Z.; Jin, Y.; Liu, K.; Qiu, W.; Ke, L.; Sun, Y.; He, C.; Hou, D.; et al. A deep survival analysis method based on ranking. Artif. Intell. Med. 2019, 98, 1–9. [Google Scholar] [CrossRef] [PubMed]
Hao, L.; Kim, J.; Kwon, S.; Ha, I.D. Deep learning-based survival analysis for high-dimensional survival data. Mathematics 2021, 9, 1244. [Google Scholar] [CrossRef]
Lee, E.; Wang, J. Statistical Methods for Survival Data Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
Hothorn, T.; Bühlmann, P.; Dudoit, S.; Molinaro, A.; van der Laan, M. Survival ensembles. Biostatistics 2006, 7, 355–373. [Google Scholar] [CrossRef]
Wrobel, L.; Gudys, A.; Sikora, M. Learning rule sets from survival data. BMC Bioinform. 2017, 18, 285–297. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Feng, D. DNNSurv: Deep Neural Networks for Survival Analysis Using Pseudo Values. arXiv 2019, arXiv:1908.02337v2. [Google Scholar]
Zhao, Z.; Zobolas, J.; Zucknick, M.; Aittokallio, T. Tutorial on survival modeling with applications to omics data. Bioinformatics 2024, 40, btae132. [Google Scholar] [CrossRef]
Marinos, G.; Kyriazis, D. A Survey of Survival Analysis Techniques. In Proceedings of the HEALTHINF, Online, 11–13 February 2021; pp. 716–723. [Google Scholar] [CrossRef]
Wang, P.; Li, Y.; Reddy, C. Machine Learning for Survival Analysis: A Survey. ACM Comput. Surv. (CSUR) 2019, 51, 1–36. [Google Scholar] [CrossRef]
Wiegrebe, S.; Kopper, P.; Sonabend, R.; Bischl, B.; Bender, A. Deep learning for survival analysis: A review. Artif. Intell. Rev. 2024, 57, 1–34. [Google Scholar] [CrossRef]
Ishwaran, H.; Kogalur, U. Random Survival Forests for R. R News 2007, 7, 25–31. [Google Scholar]
Belle, V.V.; Pelckmans, K.; Suykens, J.; Huffel, S.V. Survival SVM: A practical scalable algorithm. In Proceedings of the ESANN, Bruges, Belgium, 23–25 April 2008; pp. 89–94. [Google Scholar]
Chen, G.H. An Introduction to Deep Survival Analysis Models for Predicting Time-to-Event Outcomes. Found. Trends Mach. Learn. 2024, 17, 921–1100. [Google Scholar] [CrossRef]
Arroyo, A.; Cartea, A.; Moreno-Pino, F.; Zohren, S. Deep attentive survival analysis in limit order books: Estimating fill probabilities with convolutional-transformers. Quant. Financ. 2024, 24, 35–57. [Google Scholar] [CrossRef]
Hu, S.; Fridgeirsson, E.; van Wingen, G.; Welling, M. Transformer-based deep survival analysis. In Proceedings of the Survival Prediction-Algorithms, Challenges and Applications, PMLR, Palo Alto, CA, USA, 22–24 March 2021; pp. 132–148. [Google Scholar]
Li, C.; Zhu, X.; Yao, J.; Huang, J. Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; IEEE Computer Society: Los Alamitos, CA, USA, 2022; pp. 4256–4262. [Google Scholar] [CrossRef]
Zhang, X.; Mehta, D.; Hu, Y.; Zhu, C.; Darby, D.; Yu, Z.; Merlo, D.; Gresle, M.; Van Der Walt, A.; Butzkueven, H.; et al. Adaptive transformer modelling of density function for nonparametric survival analysis. Mach. Learn. 2025, 114, 31. [Google Scholar] [CrossRef]
Kaplan, E.; Meier, P. Nonparametric Estimation from Incomplete Observations. J. Am. Stat. Assoc. 1958, 53, 457–481. [Google Scholar] [CrossRef]
Beran, R. Nonparametric regression with randomly censored survival data. In Technical Report; University of California: Berkeley, CA, USA, 1981. [Google Scholar]
Chen, G.H. Survival kernets: Scalable and interpretable deep kernel survival analysis with an accuracy guarantee. J. Mach. Learn. Res. 2024, 25, 1–78. [Google Scholar]
Evers, L.; Messow, C.M. Sparse kernel methods for high-dimensional survival data. Bioinformatics 2008, 24, 1632–1638. [Google Scholar] [CrossRef] [PubMed]
Gefeller, O.; Michels, P. A review on smoothing methods for the estimation of the hazard rate based on kernel functions. In Proceedings of the Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics, Neuchatel, Switzerland, August 1992; Springer: Berlin/Heidelberg, Germany, 1992; pp. 459–464. [Google Scholar] [CrossRef]
Cawley, G.; Talbot, N.; Janacek, G.; Peck, M. Bayesian kernel learning methods for parametric accelerated life survival analysis. In Proceedings of the First International Conference on Deterministic and Statistical Methods in Machine Learning, Sheffield, UK, 7–10 September 2004; pp. 37–55. [Google Scholar] [CrossRef]
Li, H.; Luan, Y. Kernel Cox regression models for linking gene expression profiles to censored survival data. In Proceedings of the Pacific Symposium Biocomputing 2003; World Scientific: Singapore, 2002; pp. 65–76. [Google Scholar] [CrossRef]
Rong, Y.; Zhao, S.D.; Zheng, X.; Li, Y. Kernel Cox partially linear regression: Building predictive models for cancer patients’ survival. Stat. Med. 2024, 43, 1–15. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Zhu, H.; Ahn, M.; Ibrahim, J.G. Weighted functional linear Cox regression model. Stat. Methods Med. Res. 2021, 30, 1917–1931. [Google Scholar] [CrossRef]
Cox, D. Regression models and life-tables. J. R. Stat. Soc. Ser. (Methodol.) 1972, 34, 187–220. [Google Scholar] [CrossRef]
Tutz, G.; Schmid, M. Modeling Discrete Time-to-Event Data; Springer: New York, NY, USA, 2016; Volume 3. [Google Scholar] [CrossRef]
Suresh, K.; Severn, C.; Ghosh, D. Survival prediction models: An introduction to discrete-time modeling. BMC Med. Res. Methodol. 2022, 22, 207. [Google Scholar] [CrossRef]
Kvamme, H.; Borgan, Ø. Continuous and discrete-time survival prediction with neural networks. Lifetime Data Anal. 2021, 27, 710–736. [Google Scholar] [CrossRef]
Zhong, C.; Tibshirani, R. Survival analysis as a classification problem. arXiv 2019, arXiv:1909.11171v2. [Google Scholar] [CrossRef]
Nadaraya, E. On estimating regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
Watson, G. Smooth regression analysis. Sankhya Indian J. Stat. Ser. 1964, 26, 359–372. [Google Scholar]
Luong, T.; Pham, H.; Manning, C. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Lisbon, Portugal, 2015; pp. 1412–1421. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Kvamme, H.; Borgan, O.; Scheel, I. Time-to-Event Prediction with Neural Networks and Cox Regression. J. Mach. Learn. Res. 2019, 20, 1–30. [Google Scholar]
Coolen, F. An imprecise Dirichlet model for Bayesian analysis of failure data including right-censored observations. Reliab. Eng. Syst. Saf. 1997, 56, 61–68. [Google Scholar] [CrossRef]
Coolen, F.; Yan, K. Nonparametric predictive inference withright-censored data. J. Stat. Plan. Andin. 2004, 126, 25–54. [Google Scholar] [CrossRef]
Mangili, F.; Benavoli, A.; de Campos, C.; Zaffalon, M. Reliable survival analysis based on the Dirichlet process. Biom. J. 2015, 57, 1002–1019. [Google Scholar] [CrossRef]
Tang, Z.; Liu, L.; Chen, Z.; Ma, G.; Dong, J.; Sun, X.; Zhang, X.; Li, C.; Zheng, Q.; Yang, L.; et al. Explainable survival analysis with uncertainty using convolution-involved vision transformer. Comput. Med. Imaging Graph. 2023, 110, 102302. [Google Scholar] [CrossRef]
Wang, Y.; Kong, X.; Bi, X.; Cui, L.; Yu, H.; Wu, H. ResDeepSurv: A Survival Model for Deep Neural Networks Based on Residual Blocks and Self-attention Mechanism. Interdiscip. Sci. Comput. Life Sci. 2024, 16, 405–417. [Google Scholar] [CrossRef]
Salerno, S.; Li, Y. High-dimensional survival analysis: Methods and applications. Annu. Rev. Stat. Its Appl. 2023, 10, 25–49. [Google Scholar] [CrossRef]
Bender, A.; Rügamer, D.; Scheipl, F.; Bischl, B. A general machine learning framework for survival analysis. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Ghent, Belgium, 14–18 September 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 158–173. [Google Scholar] [CrossRef]
Emmert-Streib, F.; Dehmer, M. Introduction to Survival Analysis in Practice. Mach. Learn. Knowl. Extr. 2019, 1, 1013–1038. [Google Scholar] [CrossRef]
Chen, G.H. Deep kernel survival analysis and subject-specific survival time prediction intervals. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Durham, NC, USA, 7–8 August 2020; pp. 537–565. [Google Scholar]
Yang, X.; Qiu, H. Deep Gated Neural Network With Self-Attention Mechanism for Survival Analysis. IEEE J. Biomed. Health Inform. 2024, 29, 2945–2956. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Sun, J. Survtrace: Transformers for survival analysis with competing events. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Chicago, IL, USA, 7–10 August 2022; pp. 1–9. [Google Scholar] [CrossRef]
Jiang, S.; Suriawinata, A.A.; Hassanpour, S. MHAttnSurv: Multi-head attention for survival prediction using whole-slide pathology images. Comput. Biol. Med. 2023, 158, 106883. [Google Scholar] [CrossRef] [PubMed]
Teng, J.; Yang, L.; Wang, S.; Yu, J. A Semi-Supervised Transformer Survival Prediction Model for Lung Cancer. Adv. Funct. Mater. 2025, 35, 2419005. [Google Scholar] [CrossRef]
Wang, Z.; Gao, Q.; Yi, X.; Zhang, X.; Zhang, Y.; Zhang, D.; Liò, P.; Bain, C.; Bassed, R.; Li, S.; et al. Surformer: An interpretable pattern-perceptive survival transformer for cancer survival prediction from histopathology whole slide images. Comput. Methods Programs Biomed. 2023, 241, 107733. [Google Scholar] [CrossRef] [PubMed]
Yao, Z.; Chen, T.; Meng, L.; Wong, K.C. A Multi-head Attention Transformer Framework for Oesophageal Cancer Survival Prediction. In Proceedings of the 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC), Xiamen, China, 27–29 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 309–313. [Google Scholar] [CrossRef]
Hill, B.M. De Finetti’s theorem, induction, and A (n) or Bayesian nonparametric predictive inference (with discussion). Bayesian Stat. 1988, 3, 211–241. [Google Scholar]
Walley, P. Inferences from multinomial data: Learning about a bag of marbles. J. R. Stat. Soc. Ser. 1996, 58, 3–57. [Google Scholar] [CrossRef]
Harrell, F.; Califf, R.; Pryor, D.; Lee, K.; Rosati, R. Evaluating the yield of medical tests. J. Am. Med. Assoc. 1982, 247, 2543–2546. [Google Scholar] [CrossRef]
May, M.; Royston, P.; Egger, M.; Justice, A.; Sterne, J. Development and validation of a prognostic model for survival time data: Application to prognosis of HIV positive patients treated with antiretroviral therapy. Stat. Med. 2004, 23, 2375–2398. [Google Scholar] [CrossRef]
Uno, H.; Cai, T.; Pencina, M.; D’Agostino, R.; Wei, L.J. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat. Med. 2011, 30, 1105–1117. [Google Scholar] [CrossRef]
Brier, G. Verification of forecasts expressed in terms of probability. Mon. Weather. Rev. 1950, 78, 1–3. [Google Scholar] [CrossRef]
Graf, E.; Schmoor, C.; Sauerbrei, W.; Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 1999, 18, 2529–2545. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Rubinstein, R.; Kroese, D. Simulation and the Monte Carlo Method, 2nd ed.; Wiley: Hoboken, NJ, USA, 2008; p. 345. [Google Scholar] [CrossRef]
Smith, N.; Tromble, R. Sampling uniformly from the unit simplex. In Technical Report 29; Johns Hopkins University: Baltimore, MD, USA, 2004. [Google Scholar]
Gyorfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar] [CrossRef]
Lei, Y.; Dogan, U.; Zhou, D.X.; Kloft, M. Data-dependent generalization bounds for multi-class classification. IEEE Trans. Inf. Theory 2019, 65, 2995–3021. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
Demsar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Lee, C.; Zame, W.; Yoon, J.; van der Schaar, M. Deephit: A deep learning approach to survival analysis with competing risks. In Proceedings of the 32nd Association for the Advancement of Artificial Intelligence (AAAI) Conference, New Orleans, LA, USA, 2–7 February 2018; pp. 1–8. [Google Scholar] [CrossRef]
Katzman, J.; Shaham, U.; Cloninger, A.; Bates, J.; Jiang, T.; Kluger, Y. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 2018, 18, 1–12. [Google Scholar] [CrossRef]
Poché, A.; Hervier, L.; Bakkay, M.C. Natural example-based explainability: A survey. In Proceedings of the World Conference on Explainable Artificial Intelligence, Lisbon, Portugal, 26–28 July 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 24–47. [Google Scholar] [CrossRef]
Destercke, S.; Antoine, V. Combining Imprecise Probability Masses with Maximal Coherent Subsets: Application to Ensemble Classification. In Advances in Intelligent Systems and Computing, Proceedings of the Synergies of Soft Computing and Statistics for Intelligent Data Analysis; Springer: Berlin/Heidelberg, Germany, 2013; Volume 190, pp. 27–35. [Google Scholar] [CrossRef]

Figure 1. Dependence of the C-index and IBS on the number of features for the Linear, Quadratic, Strong Feature Interactions datasets.

Figure 2. Dependence of the C-index and IBS on the number of features for the Sparse Features, Nonlinear, and Noisy datasets.

Figure 3. Dependence of the C-index and IBS on parameter k for the Friedman1, Friedman2, and Friedman3 datasets.

Figure 4. Dependence of the C-index and IBS on the proportion of censored data for the Strong Feature Interactions and Sparse Features datasets.

Figure 5. Comparison of iSurvJ(G) and the Beran estimator for different values of the censored observation proportion for the Linear, Quadratic, and Strong Feature Interactions datasets.

Figure 6. Comparison of iSurvJ(G) and the Beran estimator for different values of the censored observation proportion for the Friedman1, Friedman2, and Friedman3 datasets.

Figure 7. Comparison of iSurvJ(G) and the Beran estimator for different values of the censored observation proportion for the Sparse Features, Nonlinear, and Noisy datasets.

Figure 8. Interval-valued SFs for the Friedman1 dataset for different values of the censoring rate.

Figure 9. Interval-valued SFs for the Nonlinear dataset for different values of the censoring rate.

Figure 10. Unconditional SFs for the Veterans and GBSG2 datasets.

Figure 11. Unconditional SFs for the WHAS500 and Breast Cancer datasets.

Figure 12. Comparison of predicted expected times obtained by the Beran estimator and iSurvJ(G).

Table 1. C-indices obtained for real datasets using the Beran estimator, iSurvM, iSurvQ, iSurvJ, and iSurvJ(G). The best results for each dataset are highlighted in bold, and the second-best results are underlined.

Dataset	Beran	iSurvM	iSurvQ	iSurvJ	iSurvJ(G)
Veterans	0.7040	0.7103	0.7295	0.7196	0.7003
AIDS	0.7529	0.7319	0.7359	0.7139	0.7483
Breast Cancer	0.6519	0.6344	0.6240	0.6487	0.6600
WHAS500	0.7468	0.7632	0.7596	0.7616	0.7575
GBSG2	0.6730	0.6827	0.6883	0.6863	0.6706
BLCD	0.5009	0.5068	0.4835	0.5067	0.5080
LND	0.4695	0.5264	0.5663	0.5730	0.4229
GCD	0.4535	0.5201	0.5717	0.5690	0.4266
CML	0.6410	0.6633	0.6806	0.6917	0.6424
Rossi	0.5817	0.6088	0.6068	0.5853	0.6186
METABRIC	0.6261	0.6410	0.6418	0.6470	0.6407

Table 2. Brier score obtained for real datasets using the Beran estimator, iSurvM, iSurvQ, iSurvJ, and iSurvJ(G). The best results for each dataset are highlighted in bold, and the second-best results are underlined.

Dataset	Beran	iSurvM	iSurvQ	iSurvJ	iSurvJ(G)
Veterans	0.1373	0.1189	0.1221	0.1229	0.1365
AIDS	0.0716	0.0791	0.0805	0.0706	0.0678
Breast Cancer	0.1853	0.2539	0.2377	0.2425	0.2071
WHAS500	0.2174	0.1896	0.1900	0.1883	0.2186
GBSG2	0.2074	0.2019	0.1995	0.1998	0.2186
BLCD	0.2952	0.2701	0.2783	0.2800	0.2716
LND	0.2423	0.2332	0.2194	0.2133	0.2180
GCD	0.1996	0.1911	0.1860	0.1850	0.1991
CML	0.1456	0.1325	0.1372	0.1339	0.1488
Rossi	0.3086	0.1068	0.1083	0.1090	0.1084
METABRIC	0.2015	0.2002	0.1980	0.1952	0.2078

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Konstantinov, A.; Utkin, L.; Efremenko, V.; Muliukha, V.; Lukashin, A.; Verbova, N. Survival Analysis as Imprecise Classification with Trainable Kernels. Mathematics 2025, 13, 3040. https://doi.org/10.3390/math13183040

AMA Style

Konstantinov A, Utkin L, Efremenko V, Muliukha V, Lukashin A, Verbova N. Survival Analysis as Imprecise Classification with Trainable Kernels. Mathematics. 2025; 13(18):3040. https://doi.org/10.3390/math13183040

Chicago/Turabian Style

Konstantinov, Andrei, Lev Utkin, Vlada Efremenko, Vladimir Muliukha, Alexey Lukashin, and Natalya Verbova. 2025. "Survival Analysis as Imprecise Classification with Trainable Kernels" Mathematics 13, no. 18: 3040. https://doi.org/10.3390/math13183040

APA Style

Konstantinov, A., Utkin, L., Efremenko, V., Muliukha, V., Lukashin, A., & Verbova, N. (2025). Survival Analysis as Imprecise Classification with Trainable Kernels. Mathematics, 13(18), 3040. https://doi.org/10.3390/math13183040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Survival Analysis as Imprecise Classification with Trainable Kernels

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Survival Analysis

3.2. Attention Mechanism and the Nadaraya–Watson Regression

4. Survival Analysis as an Imprecise Multi-Label Classification Problem

5. Training Strategies and Three Survival Models

5.1. A General Form of Attention Weights

5.2. First Model

5.3. Second Model

5.4. Third Model

5.5. Extended Intervals

6. Numerical Experiments

6.1. Description of Synthetic Data

6.2. Description of Real Data

6.3. Study of the Model Properties

6.3.1. Dependence on the Number of Features

6.3.2. Dependence on Parameter k

6.3.3. Dependence on the Number of Censored Observations

6.3.4. Intervals for Survival Functions

6.3.5. Unconditional Survival Functions

6.3.6. An Illustrative Comparative Example of iSurvJ(G) and Beran Estimator

6.4. Some Conclusions from Numerical Experiments

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI