1. Introduction
Accurate pricing models are crucial for trading, hedging, and managing risk in option portfolios. In 1973, Black, Scholes, and Merton introduced their famous option pricing model, which is now known as the Black–Scholes (BS) model. This model assumes that the underlying stock follows a Brownian motion with a constant drift and volatility. Despite its simplicity, the Black–Scholes model and its variants remain the most widely used option pricing models in finance today.
One critical parameter in the Black–Scholes option pricing model is market volatility, which can be estimated using implied volatility. To calculate implied volatility, one inputs the market price of the option into the Black–Scholes formula and back-solves for the value of the volatility. Implied volatility is an estimate of the asset’s future variability that underlies the option contract. In the Black–Scholes framework, volatility is assumed to be constant across strikes, time to maturity, and time. However, in reality, there are variations in implied volatility across option strikes and time to maturity. Canina and Figlewski [
1] demonstrated that when plotting implied volatility against moneyness (the ratio between the strike price and underlying spot price) for a given time to maturity, the resulting plot often resembles a smile or skew in shape. This variation in implied volatility across different time to maturity and moneyness is known as the implied volatility smile. An implied volatility surface (IVS, for the remaining of the paper, IVS only refers to the implied volatility surface) is a 3D plot that plots implied volatility smile and term structure of volatility in a three-dimensional surface for all options on a given underlying asset.
In recent decades, economists and finance experts have sought to exploit the predictability of the IVS. Dumas, Fleming, and Whaley [
2] proposed a linear model of implied volatility over strike price and time to maturity. However, when they applied the model to a weekly cross-section of S&P 500 options prices, they found that coefficient estimates were highly unstable. In 2000, Heston and Nandi [
3] developed a moving window non-linear GARCH(1,1) model for the IVS. While their approach was an improvement over previous methods, they also found that some of the coefficients were unstable. Goncalves and Guidolin [
4] proposed a method that models the implied volatility over a time-adjusted measure of moneyness and time to maturity. This method represents the culmination of several decades of research into the predictability of the IVS and requires a deep understanding of financial practices in this area.
What if we lack thorough knowledge of the area? Can we use data-driven methods instead of human intelligence to identify a reasonable functional form for IVS? Additionally, is it possible to use data-driven methods to determine useful features that impact the IVS? Traditional machine learning models such as random forest, gradient boosting, etc., can provide desirable prediction performances. However, the models proposed by these methods are non-parametric and cannot have any interpretation in finance. Moreover, these models can be highly sensitive to the tuning of hyperparameters.
Symbolic regression is a machine learning approach that can discover the underlying mathematical expressions describing a dataset, which could be a viable method for studying the actual relationship between IVS, moneyness, and time to maturity. The advantage of symbolic regression is that it can identify the relationship between input variables and output variables in a given dataset without a predefined functional form. In recent years, this method has been used in areas such as physics and artificial intelligence (AI). The most commonly used approach for symbolic regression is genetic programming (GP), which was proposed and improved by Schmidt and Lipson [
5] and Back et al. [
6]. Udrescu and Tegmark [
7] proposed a physics-inspired method for symbolic regression called AI Feynman, which has proven successful in discovering all 100 equations from the Feynman Lectures on Physics. Peterson et al. [
8] presented the approach called deep symbolic regression (DSR), which is a gradient-based approach for symbolic regression. One commonality among these methods is that they have only been used and tested on data with low noise levels.
In this paper, we aim to explore whether symbolic regression approaches can have good performance in discovering mathematical relationships among IVS, moneyness, and time to maturity using financial data with high noise levels. We also adopt Bayesian optimization to tune hyperparameters for symbolic regression.
This paper is structured as follows.
Section 2 provides a brief introduction to implied volatility surface (IVS). Two symbolic regression approaches, GP and DSR, are introduced in detail in
Section 3. In
Section 4, we thoroughly introduce the parameter tuning approach, Bayesian optimization. Simulation studies are conducted in
Section 5, and a real data analysis is presented in
Section 6. Finally, in
Section 7, we summarize the paper.
2. Implied Volatility Surface
Before diving into the concept of Implied Volatility Surface, it is important to understand the Black–Scholes (BS) model. The BS model is an option pricing model widely used by market participants such as hedge funds to determine the theoretically fair value of an option contract. The model was first proposed by Black, Scholes and Merton in 1973. The Black–Scholes model can only be used to calculate the price of a European option. A European option is a version of an option contract that limits execution to its expiration date. It allows the holder to potentially transact on an underlying asset at a preset price. There are two types of European options: the call option and the put option. A European call option gives the owner the right to acquire the underlying asset at the expiration date for the preset price, while a European put option allows the holder to sell the underlying asset at expiry for the preset price. These pre-specified prices are called strike prices.
The BS model assumes the price of the underlying asset, which follows a geometric Brownian motion with constant drift and volatility, that is:
where
is the price of the asset at time
t,
is the drift,
is the volatility, and
is the standard Brownian motion. Black and Scholes [
9] proposed a partial differential equation (PDE) governing the price evolution of a European option under the Black–Scholes model. The Black–Scholes formula is a solution to the Black–Scholes PDE, which calculates the price of European put or call options. Without loss of generality, let us suppose we have a call option with price
C, then
where
K is the strike price, which is the fixed price for the asset that can be bought by the option holder;
T is time to maturity, which represents the time until the option’s expiration;
is the current price of the underlying asset,
r is the risk-free interest rate, and
is the cumulative standard normal distribution function.
is the constant volatility. We can solve for the unknown volatility
through Equation (
2) using the observed option prices to obtain the implied volatility (IV).
If the assumptions in the Black–Scholes model hold, for options written on the same underlying but with different strike price or time to maturity, the implied volatility would be the same. However, this is not observed in practice.
Figure 1 shows the volatility surface for put options from 30 stocks from the S&P 500 Index on 21 January 2022. These data will also be used in our empirical study. It is apparent that lower strikes tend to have higher implied volatility. Additionally, for a given time to maturity
T, the curve of implied volatility and strike exhibits a skew or smile shape. The observed changes in implied volatility with strike and maturity contradict the Black–Scholes assumption. Therefore, the objective of this paper is to estimate this surface as a function of all
K and
T, that is, find a functional form of
where
t indicates its time-dependence.
Several economists have dedicated years of effort to understand this empirical scenario using financial theories and building models to predict IVS. In this paper, we attempt to explore the ability of a machine learning (ML) method called symbolic regression to identify IVS using moneyness (a term that describes the relationship between the strike price of an option and the underlying price of the asset) and time to maturity, without requiring an in-depth understanding of financial theories.
5. Simulation Study
In the previous sections, we introduced two approaches for symbolic regression as well as methods for tuning their parameters. In this section, we evaluate the performance of these approaches through simulation studies to determine whether they can accurately identify the true relationships between inputs and outputs in a given dataset.
We simulated data using the following five “true” relationships:
Simulation 1: ;
Simulation 2: ;
Simulation 3: ;
Simulation 4: ;
Simulation 5: ;
where represents the noise we add to our simulations.
The objective of Simulation 1 is to replicate the relationship between IVS, moneyness, and time to maturity based on the call and put implied volatilities observed on 2 May 2000, as detailed in
Figure 1 of Fengler et al. [
12]. In this simulation, we set
, which aligns with the observed range. Similarly, the range for time to maturity
is set as
, which is generally within the observed range of time to maturity (days/365). The response range for
is set at
. A 3D surface plot for Simulation 1 is presented in
Figure 13, where we can observe the smile curves when
is fixed, even though it is not as complex as the IVS observed in practice.
The remaining simulations aim to evaluate the effectiveness of symbolic regression in detecting specific mathematical terms of two parameters. Simulation 2 focuses on testing and polynomials nested in , while Simulation 3 tests exponential and polynomial equations. Simulation 4 evaluates the ratio of logarithms and square roots, and Simulation 5 tests a simple polynomial relationship.
We generated 3000 observations for each simulation configuration with varying noise levels. To evaluate the performance of each simulation, we started with and incrementally increased it from 0, 0.005, 0.01 to 0.1, until the method failed to detect the true mathematical relationship. We tuned the following parameters for GP-SR:
Genetic Operation Probabilities: “p crossover”, “p subtree mutation”, “p hoist mutation” “p point mutation”;
Population Size: Number of expression trees we generated in the initial population;
Initial Method: The method we use to generate the initial population;
Parsimony Coefficient: The parameter that controls the complexity levels of the proposed equations.
We used the “gplearn” module in Python to apply GP-SR and to employ the “skopt” module to apply Bayesian optimization to tune the genetic operation probabilities, population size, and initial method for GP-SR. For the parsimony coefficient, we only tested values of 0.005 and 0.01 in this simulation study. For the remaining parameters in GP-SR, we used the default values. The “initial depth” was set to , which specified the range of initial depths for the first generation of expression trees. The “population size” was set to 2000, which controlled the number of expression trees competing in each generation.
For deep symbolic regression (DSR), we utilized a Python module called “deep symbolic optimization”, developed by Brenden K Petersen, which is based on Petersen et al. [
8]. We used default hyperparameter values for learning rate and the number of layers for RNN.
For Simulation 1, when we have
, both GP-SR and DSR can successfully detect the true mathematical function, that is
. However, when we increased
to
, both methods failed to detect it. Functions detected by GP-SR and DSR are shown in
Table 1. Both of the methods at least successfully detected the term
and
.
Figure 14 depicts two angles of the 3D plot of IVS,
k and
. The blue surface represents the true simulated IVS, while the dots with different colors represent the estimated implied volatility obtained by the two “wrong” functions proposed by each method. The results obtained by the two approaches are quite similar, with both being moderately far from the true simulated surface. The RMSE values for the estimated volatility obtained by each method and the true simulated implied volatility are displayed in
Table 1.
For Simulation 2, when we have
, both GP-SR and DSR can successfully detect the true mathematical function, that is
. However, when we increased
to
, both methods failed. Functions detected by GP-SR and DSR are shown in
Table 2. Both of the functions detected the term
but failed to detect
.
Figure 15 shows two angles for the 3D plot of
y,
and
. The blue surface is the true simulated response surface, and dots with different colors represent the estimated response given by the two “wrong” functions proposed by the different methods. In this simulation, the results given by the GP-SR function are further away from the true surface than those given by DSR. The RMSE values between estimated volatility by both approaches and the true simulated implied volatility are small, as shown in
Table 2.
For Simulation 3, when we have
, GP-SR can successfully detect the true mathematical function, that is
, and it failed when
increased to
. The function GP-SR is identified as
which is also shown in
Table 3. However, DSR can only detect the true function when
. For
, it cannot propose the exact same function as the true function in the simulation. The functions’ DSRs provided under different noise levels are shown in
Table 3.
Both methods successfully detected the term
even with
but failed to detect
.
Figure 16 displays two angles of the 3D plot of
y,
and
. The blue surface represents the true simulated response surface, while the dots with different colors represent the estimated responses obtained by the four “wrong” functions proposed by each method with different levels of noise. From
Figure 16, we can see that the estimations obtained using DSR under
are nearly identical to the true responses (blue surface), indicating that DSR can be highly sensitive to even small amounts of noise in the data. Thus, in this case, we may still consider that DSR has detected the “true” surface. Comparing DSR and GP-SR, even though DSR failed to detect the true function at lower noise levels, the results obtained by DSR are closer to the true surface than those from the GP approach when
. In this case, DSR performs better under relatively higher noise levels. The RMSE values for the estimated responses obtained by each method and the true simulated responses are provided in
Table 3.
In Simulation 4 and Simulation 5, as we increase the noise level
from 0 to 0.1, both GP-SR and DSR can successfully detect the true mathematical relationships between inputs and output. The RMSE values for the estimated responses obtained by each method and the true simulated responses are provided in
Table 4 and
Table 5, respectively.
These simulation results demonstrate that both symbolic regression methods are effective in detecting the true mathematical relationships between inputs and outputs when the noise levels are low. However, as the noise levels increase, both methods may fail. In the case of more complex mathematical forms (Simulation 1), even a small amount of noise can cause both methods to fail. However, when dealing with simpler mathematical equations (Simulation 4 and Simulation 5), both methods can perform well even with higher noise levels.
All of the previous conclusions were based on a single trial of simulations for each setting, and we kept the number of observations fixed at 3000. What if we decrease the number of observations, will these methods succeed under the same noise level? Or if we increase the number of observations, will both methods tolerate higher noise level? To answer these questions, we tried more combinations of number of observations n and noise levels for each of the simulation settings.
Specifically, we first fixed and tried ; then, we fixed and tried . For each simulation setting, we ran 100 trials and calculated the return rates () for each method.
Figure 17 presents the return rates for the two approaches across different simulation settings. The left panel displays the return rates for
as
increases from 0 to 0.1, while the right panel shows the return rates for
as
n varies from 1000 to 5000. The numeric results are also summarized in
Table 6 and
Table 7. The findings indicate that across all five simulation set ups, the return rates decrease as the noise level increases for both approaches. For the more complex equations (Simulation 1 and Simulation 2), the return rates plummet when
and
, with GP-SR exhibiting particularly poor performance, with return rates approaching zero when
. However, for simpler simulations, such as Simulation 5, both approaches show a slower decrease in return rates. In general, DSR outperforms GP-SR by providing higher return rates at the same noise levels. The only exception is Simulation 3, where GP-SR performs slightly better than DSR when
, which suggests that GP-SR may be more effective in detecting the mathematical term
.
When we fixed and increased the number of observations n, there was no clear and stable trend in the changes in return rates for the same simulation type. This implies that the performances of both methods are more influenced by the complexity of the “true” relationships and noise levels and are less influenced by the size of the data, provided that .
6. Empirical Study
In the simulation study, we found that both GP-SR and DSR methods perform well when the noise level is limited. However, as the noise level increases, both methods start to struggle in detecting the true relationship between inputs and outputs, even if they produce similar RMSEs. Moreover, both methods may fail with extremely small noise if the relationship between inputs and outputs is overly complex.
In this section, we evaluate the performances of two symbolic regression approaches using daily options data on the S&P 500 index. The dataset comprises 1829 days of call and put options, spanning from 2 January 2015 to 16 April 2022, and covering a diverse range of maturities and moneyness.
Following the rules described in Gao et al. [
13], we began by cleaning the data. First, we considered only options with a maturity greater than 7 days but less than a year. Second, we only considered options with a bid price greater than
dollars. Third, we removed options with a bid price larger than the offer price. Finally, we eliminated options with no implied volatility or negative prices.
Note that the implied volatility surface varies for different times t and types of options (put or call). For most dates, we had more put options data than call options data. We selected four dates with the most put options data, namely, 21, 25, 26, and 27 January 2022, and evaluated the performance of the two symbolic regression approaches in identifying the relationship between IVS, moneyness, and time to maturity using daily data separately.
For the ith option in date t, let us denote
represents time to maturity (days/365);
represents moneyness, . Here, represents the strike price, is the risk-free interest rate at date t, represents time to maturity, and represents the underlying price of the asset.
For each date, we used and as inputs and used both GP-SR and DSR together with the Bayesian optimization method to find mathematical models that can provide a good estimation for the implied volatility surface .
Goncalves et al. [
14] proposed a model that performs very well in estimating IVS. Based on their years of financial experience, they proposed adjusting moneyness
using time to maturity
. Precisely, they used
instead of raw moneyness
in the model. After adjustment, they fit the log-transformed volatility
on a daily basis using a polynomial model as the following:
We used this model as a benchmark and compared the equations obtained through symbolic regressions with the benchmark model, using RMSE and correlations between predicted implied volatility and true implied volatility as metrics of comparison.
On the date of 21 January 2022, there were 3227 put options available. Using BayesOpt to tune the parameters for both GP-SR and DSR, we obtained the following equations:
After these two equations were detected, we calculated the estimated implied volatility
for each option and computed the correlation
and RMSE for
. The results are shown in
Table 8. From
Table 8, we can see that the benchmark model provides the highest correlation and lowest RMSE. Among GP-SR and DSR, GP-SR performs better.
Figure 18 presents four angles for the 3D plot of IVS, moneyness
k, and time to maturity
. The blue dots represent the true observed implied volatility, and the surfaces with different colors represent the estimated implied volatility surfaces using different methods. From
Figure 18, we can see that when we fix
, the observed implied volatility generally increases as moneyness increases (third angle). However, when we fix the moneyness at different values, we can see obvious curves with smile shapes that move in different directions as time to maturity increases (fourth angle). All three methods can successfully capture these differences in directions.
For the date of 25 January 2022, there are 1292 put options available, and the equations detected by GP-SR and DSR are as follows. The numeric results are shown in
Table 8.
Upon analyzing the results for the date 25 January 2022, it can be observed that the benchmark model outperforms GP-SR and DSR, with the latter two methods exhibiting comparable performance but slightly worse than the benchmark. Notably, the equation obtained by DSR is overly complex, while the one detected by GP-SR is significantly simpler but yields comparable results.
Figure 19 depicts the 3D plot from four angles, where the blue dots indicate the true observed implied volatility and the surfaces with different colors represent the estimated implied volatility surfaces using the three different methods. We observed some similar patterns to those from the previous date of 21 January 2022. However, we note that only the benchmark model can slightly detect the different directions for the smile curves when we fix the moneyness at different values (fourth angle).
For the date 26 January 2022, there are 1407 put options available. The equations detected by GP-SR and DSR are as follows, and the numeric results are shown in
Table 6:
On this date, both symbolic regression methods produce comparably complex equations, but they outperform the benchmark model. Among GP-SR and DSR, DSR performs better.
Figure 20 shows that for this date, there are no clear differences in directions for the smile curves when we fix the moneyness at different values, and both symbolic regression methods perform better than before.
For this date, both symbolic regression methods outperform the benchmark model. While the DSR performs slightly better than GP-SR, its equation is overly complicated. In contrast, GP-SR provides an equally good performance with a simpler equation.
Figure 21 illustrates that when fixing moneyness at different values, we observe slightly different directions for the smile curves. Nonetheless, both symbolic regression methods still outperform the benchmark.
The results from the four single days show that there is no consistent conclusion on which method performs the best. Generally speaking, sometimes DSR provides the highest correlations and lowest RMSEs, but it tends to produce overly complicated mathematical equations. GP-SR can sometimes provide parsimonious models with equally good performances, and other times, it is the benchmark model that works the best.
For the final date, 27 January 2022, there are 1308 put options available. The equations detected by GP-SR and DSR are presented below, and the corresponding numeric results are shown in
Table 6.
Thus far, we have only analyzed daily put options data separately. However, Goncalves et al. [
14] applied their model on a daily basis by re-estimating coefficients and the IVS using the same model structure for each date. The previous exploration based on the four dates provides insight into the functional forms suggested by GP-SR and DSR. Based on some of the equations discovered for those four dates, we attempted to fit each day of the following four non-linear models to the implied volatility data every day over all 1829 different dates.
For each date t, fit the following four non-linear models on put options separately. Here, i represents the ith option in date t;
Model 1 (Based on the GP-SR equation in 21 January 2022):
Model 2 (Based on the GP-SR equation in 25 January 2022):
Model 3 (Based on the GP-SR equation in 26 January 2022):
Model 4 (Based on the GP-SR equation in 27 January 2022):
We applied the four non-linear models and the benchmark model to daily implied volatility data for all 1829 dates. We calculated the RMSE and correlation between predicted and observed implied volatility for each date using each method. The results are summarized in
Table 9 and are shown in
Figure 22, where box plots of the 1829 daily RMSE and correlation values for all five models are presented.
The box plots and summary tables show that Model 4 outperforms the Benchmark in terms of providing smaller RMSEs and higher correlations between estimated and observed implied volatility. However, all other methods perform worse than the benchmark. Additionally, we analyzed the stability of the parameter estimates for Model 4 and found that they are generally consistent across all 1829 dates, as summarized in
Table 10.
Model 4 outperforms the benchmark model in terms of providing smaller RMSEs and higher correlations between estimated implied volatility and observed implied volatility. Additionally, unlike the linear model addressed in Dumas, Fleming, and Whaley [
1] and the non-linear GARCH(1,1) model proposed by Heston and Nandi [
3] in which coefficient estimates are highly unstable, Model 4 provides stable coefficient estimates across all 1829 dates of put options data. However, the major disadvantage of Model 4 is that the form of the model may be too complicated for economists to explain. Nevertheless, this result indicates that using symbolic regression can potentially propose meaningful mathematical relationships between IVS, moneyness, and time to maturity, even without deep understanding of financial theories. It may also provide insights into what might possibly be a good feature for modeling IVS. For instance, Model 2 is a linear regression on two simple transformations, namely
and
. Although this model does not perform as well as the benchmark, it still has decent performances. This could be evidence that
and
may be good features for predicting the implied volatility surface.