1. Introduction
The occurrence of extreme events is often accompanied by significant losses, such as those caused by floods, hurricanes, and financial crises. These rare but destructive events can result in damages that are difficult to quantify precisely. Therefore, in various disciplines—including actuarial science, economics, finance, geology, ecology, meteorology, and life sciences—intensive research on extreme events is particularly crucial. A profound understanding of the mechanisms behind extreme events is essential for the effective prevention and mitigation of large-scale disasters.
The tail index (TI), a crucial metric for assessing the probability of extreme events, significantly influences the behavior of the tail of the distribution. A lower tail index indicates a higher probability of extreme events, underscoring the importance of accurately estimating the tail index in extreme value theory (EVT). Classical literature [
1,
2,
3] extensively analyzes the theoretical properties of traditional tail index estimators and their empirical applications.
In recent years, with the expansion of practical applications, scholars have increasingly recognized the importance of estimating tail indices in the presence of covariate information. When considering covariates, assuming that the tail index of the conditional distribution of the response variable may depend on these covariates provides a more complex and realistic scenario for the estimation of conditional tail indices. This type of work focuses on the statistical inference of conditional tail indices when covariates are random variables, with typical studies such as [
4,
5,
6].
Despite significant progress in statistical inference for extreme value indices, previous research has primarily focused on the inference of conditional tail indices, neglecting a deeper consideration of the relationship between covariates and response variables. This research orientation has led to relatively poor performance of covariates in explaining the underlying mechanisms of extreme event occurrences, limiting the interpretability and applicability of the model in practical scenarios. However, in real-world applications, gaining a profound understanding of the reasons behind extreme events is crucial for preventing such occurrences. To address this limitation, introducing covariates related to extreme events and assuming a correlation between tail indices and these covariates is a reasonable and natural choice. In the study by [
7], tail index regression parameters were estimated by assuming a linear relationship between tail indices and covariates. Subsequent research [
8] established asymptotic properties of the parameters, and Ref. [
9] constructed confidence intervals for regression coefficients using the empirical likelihood method. Ref. [
10] studied the covariates of mixing frequencies and used them for financial tail risk measurement. However, in practical applications, the relationship between tail indices and covariates can be highly complex, and a linear assumption may be considered too simplistic, making it challenging to capture the true impact of covariates [
11]. To enhance model flexibility, Ref. [
11] proposed a partially linear semiparametric tail index regression model based on [
8], assuming a partially linear semiparametric structure between tail indices and covariates. They established a large-sample theory for the resulting estimates. Furthermore, Ref. [
12] introduced a coefficient-varying model into tail indices, assuming the coefficient is an unknown function of a single variable, and studied its statistical properties. Ref. [
13] subsequently studied the hypothesis testing problems associated with this model. However, due to the sparsity of extreme data, nonparametric estimation faces the challenge of the curse of dimensionality as the dimension of covariates increases. Nonparametric estimation cannot guarantee efficiency for high-dimensional covariates, making it quite difficult to describe the entire distribution. Therefore, to overcome this challenge, a more flexible modeling approach is needed, cleverly combining dimensionality reduction, variable selection, and generalized tail event modeling techniques.
Inspired by the discussions above, this study aims to explore the integration of more flexible varying coefficient models (VCMs) with extreme value analysis to derive effective estimates of tail indices with interaction effects among covariates. A key feature of VCMs is their ability to allow the coefficients of covariates to vary smoothly with other variables, thus enabling the assessment of nonlinear interactions. Notably, the varying index coefficient model (VICM) proposed by [
14], which encompasses a range of commonly used semiparametric models, offers sufficient flexibility for diverse applications.
The VICM can model and assess nonlinear interaction effects between grouped covariates on the response variable, addressing situations where individual covariate effects are weak but combined effects are strong. It effectively overcomes the curse of dimensionality commonly encountered in high-dimensional nonparametric estimation while combining the advantageous characteristics of the single-index model and the coefficient-varying model, as highlighted by [
15]. Moreover, it is easily interpretable in practical applications. As a highly useful semiparametric model, the coefficient-varying model includes many other important statistical models as special cases. For instance, the partially linear single-index model [
16]; additive model [
17]; partially linear additive model [
18]; and coefficient-varying model [
19], among others. Due to the numerous advantages of the VICM, it has been extensively studied in practice. For example, Ref. [
20] extended it to time series data and developed the varying-index coefficient autoregressive model, while Ref. [
21] extended it to the field of quantile regression, investigating its statistical properties in high-dimensional settings.
Despite significant advancements in the estimation methods and applications of coefficient-varying models in general cases, there is a notable literature gap in the context of extreme value analysis. Therefore, building upon the natural extensions proposed by [
11,
12], we extend the VICM to extreme value analysis, introducing a novel tail index regression model based on the VICM. This innovative approach aims to harness the power of the VICM technology to address the challenges posed by complex covariates in extreme value analysis.
Additionally, variable selection is incorporated into our model. In regression analysis, neglecting key predictor variables can lead to significant bias, while including irrelevant predictors can diminish estimation efficiency. Thus, variable selection is an indispensable aspect of modern statistical inference. Traditional methods for variable selection include strategies based on hypothesis testing and information criteria such as AIC and BIC [
22]. Moreover, penalty-based techniques like
[
23],
[
24], ridge regression [
25], LASSO [
26], adaptive LASSO [
27], and smoothly clipped absolute deviation (SCAD) [
28] provide effective solutions. Variable selection in varying coefficient models (VCM) has also been extensively studied. Notable works include [
29,
30,
31]. In particular, Ref. [
30], building on the SCAD method, proposed an innovative composite penalty technique. This approach accurately identifies the true structure of single-index varying coefficient models (SIVCMs), effectively selects key variables, and precisely estimates unknown index parameters and coefficient functions. Given its desirable properties such as unbiasedness, sparsity, continuity, and the oracle property, we have integrated this advanced variable selection method into our model. This integration not only enhances the predictive accuracy of the model but also ensures the efficiency and effectiveness of variable selection.
This study innovatively combines varying index coefficient models with extreme value theory to construct a flexible framework for analyzing extreme events. This framework aims to investigate the impact of potential factors with nonlinear interactions on the probability of extreme events. During the model construction process, we incorporated a variable selection mechanism to ensure the consistency and predictive power of the model’s estimates. The broad applicability of varying index coefficient models allows this study to encompass mainstream regression-based extreme value analysis models, including the parametric approach by [
8] and the semiparametric approach by [
11], while effectively addressing their limitations in certain scenarios. However, constructing a comprehensive theoretical framework for this model presents significant challenges. It requires an in-depth analysis of both parametric and nonparametric estimates and the integration of more flexible tail index models. Moreover, the complex interactions between parameters and the intricacies of technical details make the study of asymptotic theory exceptionally challenging. Ref. [
32] revealed the limitations of one-step spline function approximation in asymptotic distribution. Therefore, we adopted the two-step estimation strategy proposed by [
33,
34,
35]. First, we approximate the nonparametric function using B-spline functions to obtain preliminary estimates of the parameters and the nonparametric function. Then, we update the nonparametric single-index function using the B-spline back-fitted kernel smoothing (BSBK) method, thereby establishing the asymptotic properties of the nonparametric function. To validate the model’s performance with finite samples and the effectiveness of variable selection, we conducted Monte Carlo simulations. Additionally, to demonstrate the practical application of the model, we provided a real data case study, analyzing risk factors using Chinese stock market index data.
This study makes significant contributions to the field in three main areas. First, we introduce a novel varying index coefficient tail index regression model (VICM-TIR) and systematically investigate its asymptotic properties, including the consistency of variable selection and the oracle property of the estimators, thus providing a robust theoretical foundation for complex data analysis. To the best of our knowledge, previous models have not achieved this. Second, the VICM-TIR model adeptly handles nonlinear interaction effects among covariates and effectively overcomes the curse of dimensionality, demonstrating exceptional flexibility and interpretability. It encompasses the mainstream models currently documented in the literature. Finally, in practical applications, the VICM-TIR model exhibits remarkable modeling flexibility and applicability. It performs exceptionally well even in small sample sizes and high-dimensional scenarios, offering a powerful tool for complex data analysis. Notably, it shows significant innovative value in the analysis of extreme events.
The subsequent sections of this paper are structured as follows: In
Section 2, we introduce the varying index coefficient model for tail index regression, detailing the estimation procedure and the method for selecting tuning parameters.
Section 3 establishes the asymptotic theory and the properties of the estimators.
Section 4 presents the findings of a simulation study, evaluating the model’s performance across various scenarios.
Section 5 illustrates the practicality and effectiveness of the model through an analysis of real-world data. Finally,
Section 6 concludes the paper, and the
Appendix A contains the proofs of the theorems presented.
3. Asymptotic Theory
In this section, we delve into the asymptotic properties of the proposed estimators. The detailed proofs for these attributes, as well as the underlying assumptions, are relegated to the
Appendix A.
To establish notation, we designate as the true value of for the duration of this text. For conciseness, we introduce as a concatenation of , where signifies the j-th component, . Without affecting generality, we presume that for and for . Furthermore, denotes the nonzero varying coefficients for , constant nonzero values for , and zero values for .
Carrying on with the previous notation, we define
as the Jacobian matrix of size
representing the partial derivative of
with respect to
.
To simplify the notation, let
. Furthermore, we define the space
as the set of functions with a finite
norm on the domain
.
where
and
. To investigate the large-sample characteristics of parameter estimators, we introduce
as the vector of true parameters, where
and
for
. For each
, let
be the function that satisfies the given condition:
Let
, and the gradient matrix of the log-TI is
For any matrix
, denote
. Then define
Theorem 1. Suppose that Assumptions A1–A11 in the Appendix A hold and the number of knots . Then, we have (i) ;
(ii) , where Theorem 2. Use Assumptions A1–A11 in the Appendix A and the number of knots . IfSuppose Then, with probability approaching one, and must satisfy:
(i) ;
(ii) are nonzero constants for and .
Let
,
, and
Theorem 3. Under Assumptions A1–A11 in the Appendix A and Theorem 2, we havewhere and its dimension is , is defined in (22) and is given at (26). Theorems 1 and 2 establish the consistency of the variable selection process. Furthermore, Theorems 1, 2, and 3 collectively demonstrate the oracle property of
. Specifically, these theorems indicate that our proposed estimators achieve the optimal convergence rate and maintain asymptotic distribution consistency with estimators based on the correct submodel. Notably, Theorem 1 shows that the spline estimator
obtained from the estimation procedure in (
18) is a consistent estimator of
. However, the asymptotic distribution of
is not available. To address this issue, we employ a two-step spline backfitted local linear (SBLL) estimation method to further process the nonparametric function
. Without loss of generality, we focus on the estimation process of the first nonparametric function
, as the estimation of other functions can be similarly achieved. For the spline estimates
for
, we use them as initial estimates and define
. Then, let
, and for each given
,
is estimated through local linear fitting as
, where
and
is the bandwidth. Then we drive the estimator
by minimizing the following local kernel objective function:
where
is a non-negative symmetric kernel function,
and
. Since
for
are unknown, we adapt (
28) by substituting the spline estimators
from (
18) for
. This substitution is equivalent to replacing
in (
28) with
. The resulting modified SBLL estimator is denoted as
. Denote
and
for
and assume that the following expressions are convergent in probability when the sample size
, that is,
where
is the marginal probability density function of
.
Theorem 4. Suppose that Assumptions A1–A11 in the Appendix A are satisfied. For , as , for any , we havewhere with , and defined in (29), (30), and (31), respectively. Next, we establish the uniform oracle efficiency of the SBLL estimator . Specifically, Theorem 5 demonstrates that the absolute difference between and is uniformly bounded by . Consequently, this ensures that and share the same asymptotic distribution.
Theorem 5. Under Assumptions A1–A11 in the Appendix A, and , we have Corollary 6. Under Assumptions A1–A11 in the Appendix A, and , as , we have 4. Monte Carlo Studies
In this section, we evaluate the finite-sample performance of the proposed estimator through Monte Carlo simulations. Adhering to the setup outlined in [
8,
11], we postulate that the response variable
follows a specific distribution, detailed as follows:
Afterwards, let
and
,
.
By adjusting the parameter
c of the slowly varying function, a diverse set of distributions for
y can be generated, thereby facilitating simulations across a range of scenarios. Specifically, the values of
c were chosen as
to illustrate different scenarios, while the sample size
n was varied as
to examine the performance across varying data sizes. For the parametric components, the marginal distributions of
and
were specified as follows:
and
. Additionally, the true single-index function was represented by a smooth function, defined as:
Adopting the methodology of [
18], we utilize equidistant knots with a constant
K value set to 3. To ascertain the sample fraction, as suggested by [
8], we analyze a set of 100 distinct
values (denoted as
) along with their corresponding sample fractions (
) distributed uniformly within the range
. Additionally, for each model configuration, we conduct 5000 simulation runs. The precision of estimating
and
is quantified using mean squared errors, respectively,
where
signifies the regular grid points for evaluating the function
,
represents the true value of
as defined in (
2), and
denotes its estimator. For our simulation,
is adopted. The results are summarized in
Table 1, where columns
and
indicate the mean values for correctly identifying nonzero coefficients, while
and
represent the mean values for incorrectly identifying zero coefficients. The row "VICMTIR-VS" outlines the performance of our proposed estimator with the variable selection procedure model, and the row “Oracle” depicts the estimator’s performance based on the actual model with known zero coefficients.
With an increasing sample size, there is a concurrent decline in mean squared errors (MSEs) and standard deviations (STDs) across diverse parameter settings for c in the slowly varying function. Simultaneously, the accuracy of variable selection is enhanced, highlighting the robustness of the model estimators. Furthermore, the third column of the table exhibits the median sample fraction derived from 5000 realizations, demonstrating a declining pattern as the sample size expands, which aligns with our expectations. Additionally, as the sample size enlarges, the variable selection method progressively approaches the performance of the oracle procedure concerning model error.
Graphically,
Figure 1 depicts the bias of nonzero parameters, while
Figure 2 exhibits the bias of zero parameters. Both figures reveal that the mean bias of the estimators is close to 0, thus demonstrating the effectiveness of the proposed estimation approach. Furthermore,
Figure 3 showcases the fitted curves of
alongside its corresponding 95% point-by-point confidence interval, indicating a satisfactory fit for the nonlinear function.
Subsequently, a comparative evaluation is undertaken between our model and alternative parameterized and nonparameterized models, incorporating tail indices in diverse forms. The estimation performance of these models is analyzed in scenarios encompassing both low- and high-dimensional covariates, with specific dimensions set to
and
. The parameter vector
comprises
, where
is defined as earlier, and
is a
d-dimensional vector. Without loss of generality, the parameter
c in the slowly varying function is set to
for the simulations. For the linear setting, we adopt the method from [
8], using
. For the single-index model (SIM), we consider the approach from [
38], applying
. In the fully nonparametric setting (NPM), we follow the methodology described in [
39], employing kernel smoothing with the Epanechnikov kernel function, assuming equal bandwidth, and utilizing
. Finally, for the general variable coefficient model (VCM), we adhere to the approach from [
12], implementing
.
Table 2 presents the outcomes of 500 Monte Carlo simulations. In comparing the average squared errors (ASEs) among various models, we observe that for the single-index model (SIM) and the variable index coefficient model with variable selection (VICM-VS), the ASEs do not exhibit a substantial rise as the dimensionality of covariates grows under different sample sizes. Conversely, in the fully nonparametric model and the general variable coefficient model, an appreciable increase in ASEs is noted as the dimensionality of covariates increases, highlighting the dimensionality curse in high-dimensional nonparametric estimation. The oversimplified linear model, due to its inability to capture the nonlinear effects of factors, results in significant estimation errors. However, based on the experimental findings, the VICMTIR-VS approach maintains a commendable model flexibility level. It demonstrates satisfactory estimation accuracy, even in small sample sizes and large parameter dimensions.
5. Empirical Analysis
In the assessment of extreme financial occurrences and market risks, extreme value theory (EVT) emerges as a robust methodology for quantifying high-quantile random phenomena, widely acknowledged as an efficacious statistical modeling approach. Herein, we deploy this model to gauge tail risk in financial markets. Specifically, we leverage daily trading data from the CSI 300 Index in China, spanning the period from 8 April 2010, to 1 February 2023, comprising a total of 2657 observations. To authenticate the model’s effectiveness, we allocate the initial 80% of the dataset for in-sample parameter estimation, reserving the remaining 20% for out-of-sample validation.
In considering the selection of covariates, we acknowledge the direct influence of the index’s financial indicators on tail risk. Furthermore, given the phenomenon of economic globalization, tail risk is also influenced by fluctuations in international markets. Consequently, we postulate that major global market indices impact the tail risk of the CSI 300 Index by influencing its underlying financial indicators. The precise definitions and settings of each variable are outlined in
Table 3. The returns for each index are calculated by employing the formula
. The descriptive statistics in
Table 4 show that most covariates exhibited skewed distribution. Therefore, in the selection of standardization methods, following the approach outlined in [
8], the variables were transformed using rank transformations. Specifically, let
be
’s rank in the sample
, then conduct rank transformations by redefining
(the normal score transformation).
First, the model is implemented on the training dataset, leveraging the threshold selection approach outlined previously to identify the effective sample for model estimation, which comprises roughly 20% of the total. Following this, the parameter estimations are presented in
Table 5.
Figure 4 illustrates the estimated varying coefficient functions, revealing a noteworthy nonlinear association between internal variables and tail risk. Notably, the influence of internal variables exhibits a distinct interplay with international indices. Based on the estimation outcomes, the trading volume and turnover rate exhibit primarily negative effects. As these metrics increase, the tail index diminishes, thereby intensifying tail risk. Conversely, the trading value and P/BV ratio predominantly have positive impacts. When these values rise, the tail index augments, leading to a mitigation of tail risk.
After estimating the parameters, we derive the tail index. To evaluate the goodness-of-fit of the model, we employ the QQ-plot methodology [
40] by constructing a plot for the pairs
where
. Here,
is defined as
and
represents the empirical distribution of
. Ideally, if
is large enough and the model fits well,
should follow a uniform distribution on
. As depicted in
Figure 5, the close alignment between the 45-degree reference line (solid line) and the QQ-plot (dashed line) suggests a robust fit of our VICM-TIR model. The tail index of the log-return distribution, shown in the right panel of
Figure 6, reveals smaller indices, indicating heavier tails and a higher likelihood of extreme losses. Notably, the estimation results indicate that tail risk had already emerged before the turbulence observed in the Chinese stock market on 19 June 2015.