1. Introduction
Automated trading systems have become increasingly popular due to their ability to make faster, more accurate, and more reliable decisions than human traders. These systems are particularly effective in highly dynamic and volatile financial markets, where human decisions often fall short due to emotional bias or delayed reactions. Among the many advancements in automated trading, Deep Reinforcement Learning (DRL) stands out as a powerful tool. DRL enables systems to learn from large datasets and make decisions without relying on fixed market assumptions, making it highly adaptable to changing market conditions [
1,
2].
Despite its promise, prior research has predominantly focused on standalone DRL techniques or traditional ensemble methods in financial markets [
3,
4,
5]. While these approaches have demonstrated potential, they often fall short in adapting to rapidly evolving market dynamics. The integration of DRL with advanced ensemble frameworks, such as the Iterative Model Combining Algorithm (IMCA), remains an underexplored area. Addressing this gap, this study introduces a novel hybrid framework that synergistically combines the adaptability of DRL with the dynamic optimization capabilities of IMCA, thereby enhancing the robustness and efficiency of portfolio management strategies.
The proposed framework incorporates IMCA as a dynamic ensemble technique that iteratively adjusts model weights to minimize forecasting errors and adapt to shifting market conditions. Unlike traditional ensemble methods, which employ static weighting or simplistic averaging mechanisms, IMCA leverages recent model performance to recalibrate its weight distribution in real time [
6,
7]. This ensures that the combined strategy remains responsive to sudden market fluctuations, such as those experienced during the COVID-19 pandemic [
8,
9]. By dynamically harnessing the strengths of individual DRL algorithms—each excelling in specific market conditions—and compensating for their weaknesses, IMCA enables the creation of a resilient and adaptive portfolio management system.
Moreover, the integration of DRL with IMCA is particularly advantageous in emerging markets like Thailand, where market behavior is often characterized by high volatility and structural inefficiencies [
10,
11]. DRL’s capacity to learn complex, nonlinear relationships complements IMCA’s ability to dynamically adapt to real-time performance metrics. This synergy not only enhances portfolio returns but also improves risk mitigation, as the hybrid framework can effectively respond to unexpected shocks and systemic risks [
1,
2]. By bridging the gap between standalone DRL methods and static ensemble approaches, this research contributes a significant innovation to the field of automated trading and portfolio optimization.
Emerging markets, such as Thailand’s SET50 Index, present unique challenges and opportunities for testing advanced portfolio strategies. Characterized by higher inefficiencies, external dependencies, and behavioral biases compared to developed markets, the SET50 Index serves as an ideal testbed for adaptive trading systems. Its high volatility during crises like COVID-19 further underscores the importance of robust strategies capable of navigating turbulent market conditions. Moreover, studying the SET50 provides valuable insights into the application of DRL and IMCA in markets with similar dynamics across Southeast Asia and other emerging economies.
This research aims to optimize portfolio performance for SET50 stocks by combining DRL techniques with IMCA. The DRL algorithms utilized include Advantage Actor–Critic (A2C), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Soft Actor–Critic (SAC), and Twin Delayed Deep Deterministic Policy Gradient (TD3). These algorithms were chosen due to their proven success in financial applications [
4,
12]. The dataset consists of daily stock data from the SET50 Index spanning 2008 to 2023, divided into a training period (2008–2018) characterized by stable market conditions and a testing period (2018–2023), which includes the highly volatile COVID-19 era. The results demonstrate that the combined DRL–IMCA approach significantly outperforms traditional strategies, such as the Min-Variance strategy, in both returns and risk management.
This research makes several key contributions to the field of adaptive portfolio management. By integrating DRL and IMCA, it offers a novel approach to creating trading strategies that are not only profitable but also resilient to market fluctuations. The findings are particularly valuable for institutional investors, such as pension funds and mutual funds, who require stable yet dynamic strategies to manage risk in volatile markets. Retail investors can also benefit from this framework by gaining access to advanced, automated techniques that enhance portfolio performance with minimal manual intervention. Furthermore, businesses such as securities firms and FinTech startups can leverage these insights to develop competitive trading systems that are robust across market cycles.
This study builds on prior work in DRL and ensemble methods, extending the literature by addressing their integration within emerging market contexts. By evaluating the hybrid DRL–IMCA framework, this research highlights the potential for adaptive strategies to outperform traditional approaches in volatile environments.
The structure of this paper is as follows:
Section 2 reviews relevant literature,
Section 3 outlines the methodology,
Section 4 presents the results, and
Section 5 concludes with key insights and future directions.
3. Methodology
3.1. Reinforcement Learning Algorithms and Experimental Setup
This study utilizes five state-of-the-art DRL algorithms: A2C, PPO, DDPG, TD3, and SAC. These algorithms are chosen for their distinct capabilities in addressing the challenges of portfolio management in volatile markets, such as balancing risk and reward, handling high-dimensional data, and adapting to rapid market changes. The training of these models is conducted using a dataset of daily stock prices from the SET50 Index, spanning from 2008 to 2023. Before implementing the DRL algorithms, the dataset was preprocessed to ensure that it was clean and consistent.
In addition to stock prices, in order to measure sentiment, we consolidate text-based data to capture the SET50 market sentiment of a company on specific dates. News sentiment is sourced from a corpus of articles collected via Google News, a reliable aggregator. Additionally, we utilize Twitter, widely recognized for its effectiveness in forecasting stock prices and market movements [
38], to derive social media-based sentiment.
For Google News, sentiment analysis begins with collecting a corpus of articles using Google News, a reliable aggregator of news content. Each article headline or text is processed using VADER (Valence Aware Dictionary and sEntiment Reasoner), a lexicon and rule-based sentiment analysis tool designed to assess the polarity (positive, negative, or neutral) and intensity of sentiment. VADER generates a compound sentiment score for each piece of text, ranging from −1 (most negative) to +1 (most positive). These scores are then aggregated to calculate an overall sentiment score for each company based on the news coverage on specific dates, capturing the market sentiment reflected in the news.
For Twitter, sentiment analysis begins with gathering tweets related to SET50 companies. Tweets are processed using VADER, which assigns a compound sentiment score to each tweet based on its text content. To enhance the accuracy and relevance of the sentiment measurement, engagement metrics such as likes and retweets are incorporated. Tweets with higher engagement are given greater weight in the calculation of the overall Twitter sentiment score for each company. This approach ensures that tweets with significant market impact contribute more to the sentiment analysis, providing a robust measure of social media-based sentiment for specific dates.
To combine the sentiment scores from Google News and Twitter, a unified market sentiment score is calculated using a weighted-average approach. First, the sentiment scores from both sources are standardized to ensure they are on the same scale, typically normalized between −1 (most negative) and +1 (most positive). Next, equal weights are assigned to each source based on their relevance and reliability. Finally, the overall market sentiment score is calculated as a weighted average of the two scores.
Each model is trained with the following configuration:
Training Data: Daily adjusted closing prices and engineered features, including Moving Average Convergence Divergence (MACD), Relative Strength Index (RSI), and Simple Moving Averages (SMA);
Training Period: Data from 2008 to 2018 were used for training, and 2018 to 2023 for testing;
Learning Rate: 0.0003 for most models, with fine-tuning based on validation performance;
Episodes: 1000 episodes for stability and convergence;
Batch Size: 64 observations per batch;
Discount Factor (): 0.99 for all models to prioritize long-term rewards;
Exploration Rate: Initial exploration rate of 1.0, decayed over episodes for models using epsilon-greedy policies;
Optimization Method: Grid search was employed for hyperparameter tuning, including discount factors, learning rates, and batch sizes;
Computational Resources: Training was conducted on an NVIDIA RTX 3090 GPU with 24 GB memory for efficient parallel processing;
Framework: TensorFlow and PyTorch were utilized for implementing the algorithms;
Optimization: Adam optimizer was employed across all models.
To further enhance performance, transfer learning techniques are applied where pre-trained weights from models trained on global indices (e.g., S&P 500) are fine-tuned for the SET50 dataset. This approach leverages the models’ ability to generalize from diverse datasets, speeding up convergence and improving robustness.
3.1.1. Experimental Workflow
The experiments are conducted on a MacBook Pro (2020) with a 2GHz Quad-Core Intel Core i5 processor and 16 GB of memory (3733MHz LPDDR4X). The workflow involves downloading data from Yahoo Finance in approximately 2 min, followed by the addition of technical indicators, which took about 3 min. Training each model (100,000 timesteps) requires varying durations depending on the complexity of the model. Training times range from 4–70 min, with specific hyperparameters for each model summarized in
Table 1.
3.1.2. Advantage Actor–Critic (A2C)
The Advantage Actor–Critic algorithm is chosen for its ability to balance exploration and exploitation in discrete-time environments, making it particularly well-suited for stock trading applications. To enhance its performance, sentiment analysis is integrated into the model, enabling it to incorporate qualitative insights from financial news and social media alongside traditional quantitative metrics. The model is designed to prioritize risk-adjusted returns by employing a sentiment-adjusted reward function, which maximizes cumulative returns while penalizing excessive drawdowns and unfavorable sentiment exposure.
The loss function for A2C, which optimizes both the policy and value networks, is expressed as follows:
where
N represents the total number of training samples,
denotes the probability of selecting action
given the state
and sentiment
under the policy parameterized by
, and
quantifies the advantage of taking action
over the baseline policy.
The second term in the loss function incorporates a regularization constant
c, which balances the policy loss and value function loss, ensuring that the optimization remains stable. The value function
represents the estimated value of state
with sentiment
, and
denotes the observed reward, which is adjusted to account for sentiment data. Specifically,
penalizes exposure to assets with negative sentiment while incentivizing investments in assets with positive sentiment. The sentiment adjustment in
is computed as
where
and
are weighting factors that balance the impact of transaction costs and sentiment penalties. The sentiment score is derived from financial news and social media data, using Natural Language Processing (NLP) techniques to quantify the market’s perception of individual assets. A negative sentiment score penalizes the agent for holding assets perceived negatively by the market, thereby reducing risk exposure.
To create the sentiment score, by incorporating sentiment into both the state representation and the reward function, the A2C model aligns its decision-making with both quantitative metrics and qualitative market insights. This integration enables the model to dynamically respond to shifts in market sentiment, fostering an adaptive trading strategy that improves robustness and performance in complex and volatile financial environments.
3.1.3. Proximal Policy Optimization (PPO)
Proximal Policy Optimization is a state-of-the-art reinforcement learning algorithm recognized for its robust performance in volatile environments. It is specifically designed to balance exploration and exploitation while ensuring stable training, making it particularly well-suited for portfolio optimization tasks. PPO utilizes a clipping mechanism to limit excessively large policy updates, ensuring stable and incremental improvements over time. Additionally, entropy regularization is incorporated to encourage exploration, preventing the agent from prematurely converging to suboptimal policies.
The PPO loss function is defined as
where the expectation operator
has been replaced with a summation over
i, consistent with batch-wise training. The term
represents the probability ratio of the updated policy to the old policy, which is given by
. The advantage function,
, quantifies the improvement of the chosen action over the baseline. The clipping mechanism is controlled by the threshold parameter
, typically set to 0.2, which restricts updates to a predefined trust region, avoiding destabilizing policy changes. The entropy of the policy, denoted as
, promotes exploration by encouraging randomness in action selection, while
serves as a regularization coefficient to balance exploration and exploitation.
3.1.4. Deep Deterministic Policy Gradient (DDPG)
The DDPG architecture is based on an actor–critic framework with two neural networks: the actor network determines the optimal actions (portfolio weights), while the critic network evaluates the quality of these actions. Each network comprises two hidden layers, each containing 256 neurons. The loss function for DDPG, which optimizes the critic network, is defined as
where
represents the target Q-value, which estimates the expected cumulative reward based on observed outcomes, and
provides the current Q-value estimate for a given state
, sentiment
, and action
. The parameters of the Q-value network are denoted by
.
The target Q-value
is computed as
where
is the sentiment-adjusted reward,
is the discount factor that determines the weight of future rewards, and
is the Q-value of the next state-action pair estimated by the target Q-network.
3.1.5. Soft Actor–Critic (SAC)
The Soft Actor–Critic (SAC) algorithm is based on an actor–critic framework that employs a stochastic policy and entropy maximization to enhance exploration. The actor network determines the optimal stochastic actions (portfolio weights), while the critic network evaluates the quality of these actions. SAC introduces entropy into the objective function, encouraging exploration and preventing premature convergence to suboptimal policies. The entropy temperature parameter, denoted by , is automatically tuned during training to achieve an optimal balance between exploration and exploitation.
The critic network is trained by minimizing the following loss function:
where
represents the Q-value estimate for a given state
, sentiment
, and action
, with
denoting the parameters of the critic network. The target Q-value
is computed as
where
is the sentiment-adjusted reward,
is the discount factor,
is the entropy temperature, and
represents the stochastic policy output by the actor network.
The actor network is trained by minimizing the following loss function:
where
denotes the parameters of the actor network, and
encourages exploration by maximizing the entropy of the policy.
3.1.6. Twin Delayed Deep Deterministic Policy Gradient (TD3)
The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm enhances the DDPG by addressing overestimation bias, improving stability, and incorporating factors such as transaction costs, which are crucial for financial applications. TD3 begins with the initialization of the actor network
, which outputs deterministic actions
for a given state
s, and two critic networks
and
, which estimate the state-action values. The target Q-value is computed as:
where
r is the reward,
is a weighting factor that penalizes transaction costs,
is the discount factor,
is the next state, and
is the smoothed action computed by adding clipped Gaussian noise to the target policy’s action. Target networks
,
, and
are updated using a soft update rule:
where
is the soft update rate.
To train the model, experiences
are collected by executing actions
with exploration noise and storing them in the replay buffer
. A mini-batch of transitions is sampled from
to update the critics by minimizing the mean-squared error:
where
is the target Q-value, and
. The actor network is updated less frequently (e.g., once every two critic updates) by maximizing the Q-value estimated by the first critic, incorporating the effect of transaction costs:
TD3 incorporates twin critics to mitigate overestimation bias, target policy smoothing to prevent exploitation of sharp Q-value variations, and delayed policy updates to enhance stability. These features are further augmented by the integration of transaction cost penalties, making TD3 particularly well-suited for continuous control tasks, such as financial applications. By accounting for market dynamics and trading costs, TD3 provides a practical and robust solution for complex, real-world scenarios.
3.2. Iterative Model Combining Algorithm (IMCA)
IMCA is an advanced ensemble technique designed to dynamically adjust the weights of individual models for optimal portfolio management. Unlike traditional methods that rely on static weights, IMCA continuously recalibrates the contributions of each model based on recent performance. This approach is particularly effective for emerging markets like the SET50 Index, which are characterized by high inefficiencies, external dependencies, and significant volatility.
Emerging markets often experience unpredictable price movements due to external shocks, lower liquidity, and behavioral biases. These factors make static models inadequate, as they fail to adapt to rapid market changes. IMCA addresses these challenges by dynamically reallocating weights, emphasizing models that perform well in current market conditions while reducing the impact of underperforming models.
3.2.1. Steps in IMCA
The IMCA framework follows a structured process designed to optimize predictions through iterative adjustments. This methodology ensures that the model ensemble adapts to changing market conditions while maintaining high predictive accuracy.
The first step involves selecting an error metric (
), such as Root Mean Square Error (RMSE) or Mean Absolute Error (MAE), to evaluate the performance of individual models. The error metric quantifies prediction accuracy relative to observed outcomes. The general form of
is given as
where
N is the number of observations,
represents the prediction for observation
i,
denotes the actual observed value, and
p defines the type of error metric (e.g.,
for RMSE and
for MAE). This metric serves as the foundation for updating model weights.
To avoid overfitting, a regularization parameter (
) is introduced. Regularization ensures stability by discouraging disproportionately large weights for any single model, especially during short-term fluctuations. The weight adjustment equation incorporating
is expressed as
where
represents the weight of model
k at iteration
t, and
is the gradient of the error metric with respect to
.
The historical data length (l) determines the evaluation window size, ensuring that weight updates reflect recent trends while avoiding overreaction to noise. For example, if , the last 10 observations are used to compute the error metric for weight adjustments.
Model weights are iteratively updated based on their relative performance. A performance score for each model (
) is defined as
where
is the current weight of model
k and
is a small constant to prevent division by zero. Underperforming models are penalized with higher
values, while better-performing models receive lower
values. The refined weight update formula becomes
Finally, the ensemble’s prediction is generated as a weighted sum of individual model predictions:
where
is the ensemble forecast for the next time step (
),
is the updated weight of model
k,
is the prediction by model
k for the next time step, and
n is the total number of models in the ensemble.
3.2.2. Example to Illustrate IMCA
To illustrate the IMCA methodology, consider an example where five models (
) predict stock prices, and their weights are adjusted iteratively based on their performance. Suppose that the historical data length (
l) is 10 days and the error metric used is Mean Absolute Error (MAE,
). Predictions and actual values from 1 January to 10 January 2023 are used to evaluate performance. The combined forecast will be made for 11 January 2023 (
) (see
Figure 1).
First, the MAE for each model is calculated using
where
represents the prediction by model
k for day
i and
is the actual stock price for day
i. A lower MAE indicates better model performance.
Next, the performance scores (
) for the models are computed based on their current weights:
The weights are then updated iteratively using the formula
Finally, the combined prediction for 11 January 2023 is obtained as
This process ensures that the ensemble prediction reflects the strengths of individual models while dynamically adapting to their performance.
Example Output: If the initial weights for are , , , , , and their MAE scores for the last 10 days are , , , and :
This iterative process ensures that the combined prediction leverages the strengths of the best-performing models while minimizing the influence of underperforming ones. It dynamically adapts the ensemble to changing market conditions, making IMCA a robust tool for portfolio management in volatile markets.
We would like to note that vanishing gradients are a well-known challenge in training deep neural networks, as they can hinder learning in earlier layers of the network. However, the DRL models used in this study are specifically designed to mitigate this issue through modern techniques. These include ReLU activation functions, which avoid saturation and preserve gradient magnitudes, and layer normalization, which stabilizes training by maintaining consistent gradient flow across layers. Additionally, architectures like SAC and TD3 incorporate residual connections that enable gradients to bypass intermediate layers, further preventing the vanishing gradient problem. As a result, the proposed model does not suffer from vanishing gradients, eliminating the need for additional algorithms to address this issue.
3.2.3. Performance Evaluation Metrics
The effectiveness of the IMCA framework in portfolio management is assessed using a range of performance evaluation metrics that provide insights into profitability, risk, and overall portfolio performance.
Cumulative Return (CR) measures the total investment growth over the evaluation period, serving as a comprehensive indicator of overall profitability [
39]. This metric captures the net effect of all gains and losses on the portfolio during the specified timeframe.
Annual Return (AR) reflects the average yearly growth of the portfolio, enabling meaningful comparisons across different time periods and investment strategies [
40]. By annualizing returns, this metric standardizes performance evaluation for strategies operating over varying horizons.
Annualized Volatility (AV) quantifies the variability of portfolio returns on an annual basis, providing a measure of the risk level associated with the investment strategy [
41]. A higher volatility indicates greater uncertainty in returns, while lower volatility suggests more stable performance.
The Sharpe Ratio (SR) evaluates risk-adjusted returns by measuring the excess return achieved per unit of risk taken [
42]. This metric is instrumental in determining whether a portfolio’s performance justifies the level of risk incurred, offering a comparative perspective across different strategies.
Maximum Drawdown (MD) captures the largest decline in portfolio value from a peak to a trough during the evaluation period, offering insights into potential worst-case losses [
43]. This metric is particularly valuable for understanding the resilience of the portfolio under adverse market conditions.
These metrics collectively provide a comprehensive framework for evaluating IMCA’s performance, balancing profitability and risk considerations to determine its efficacy in managing dynamic financial portfolios.
4. Estimation Results
This section presents the performance evaluation of various portfolio allocation strategies, including DRL-based models and traditional approaches.
Figure 2 and
Figure 3 provide visual comparisons of cumulative returns, while
Table 2 summarizes the performance metrics for all models. The results illustrate the adaptability and robustness of DRL algorithms and the superior performance of IMCA in managing portfolio allocations under volatile market conditions.
4.1. Cumulative Return Trends
The cumulative returns of portfolio allocation models provide valuable insights into their performance over time. This section highlights the comparative analysis of both Reinforcement Learning (RL)-based models and traditional strategies, focusing on their behavior during volatile periods such as the COVID-19 pandemic and their overall recovery patterns.
Figure 2 showcases the cumulative returns of five DRL models, A2C, PPO, DDPG, SAC, and TD3, over the testing period from January 2018 to December 2023. The impact of the COVID-19 pandemic in early 2020 is evident, as all models experienced significant drawdowns during the initial market shock. However, notable differences in recovery patterns emerged:
A2C and SAC: Both models exhibited strong resilience and recovery post-2020, demonstrating their ability to adapt to volatile market conditions. Their cumulative returns surpass those of other DRL models by the end of the testing period, indicating effective portfolio rebalancing and risk management;
PPO: This model showed the weakest performance among the DRL algorithms, with relatively low cumulative returns and slower recovery rates. This outcome highlights PPO’s potential sensitivity to market volatility and its limitations in balancing exploration and exploitation;
DDPG and TD3: These models achieved moderate performance, with consistent but less aggressive recoveries compared to A2C and SAC. Their stability suggests that they are well-suited for environments with less pronounced market fluctuations.
The performance of traditional strategies and IMCA is presented in
Figure 3. IMCA demonstrates significantly higher cumulative returns compared to both traditional strategies. This result underscores its robustness and superior growth potential. The dynamic weighting mechanism employed by IMCA effectively leverages the strengths of multiple models, enabling it to outperform during both market downturns and recovery phases. This adaptability makes it particularly suitable for navigating volatile market environments.
For traditional methods, the Min-Variance strategy stands out for achieving the lowest drawdowns, making it particularly appealing to risk-averse investors. However, its limited cumulative returns highlight its inability to fully capitalize on upward market trends, particularly during recovery periods such as those following the COVID-19 pandemic. This trade-off between minimizing risk and maximizing growth exemplifies the inherent limitations of traditional strategies. Similarly, CAPM-based Mean-Variance Portfolio Optimization, which integrates the Capital Asset Pricing Model (CAPM) and Mean-Variance Optimization (MVO), seeks to balance risk and return but remains constrained by its static approach.
The CAPM portfolio allocation strategy, represented by the purple line in the chart, demonstrates significant volatility over the observed period. It performed especially poorly during periods of heightened market turbulence, such as the sharp decline in April 2020. Among the strategies compared, CAPM experienced larger drawdowns and showed weaker recovery, exposing its limitations in adapting to dynamic market conditions. Its reliance on a static, linear risk–return relationship further emphasizes the need for more adaptive and responsive portfolio strategies to effectively navigate rapidly changing market environments.
The SET50 Baseline, representing the average performance of the SET50 stock index, also illustrates the significant shortcomings of static allocation strategies. It suffered the most substantial losses during periods of market volatility, such as the COVID-19 crisis, highlighting its inability to adjust to rapid market changes. The baseline’s underperformance during turbulent times emphasizes the critical need for dynamic portfolio management strategies like the IMCA, which is better equipped to deliver consistent returns and resilience in fluctuating markets.
4.2. Overall Performance Metrics
Table 2 summarizes the performance metrics of the models. A2C, a DRL-based model, achieved the highest annual and cumulative returns, demonstrating its potential for long-term growth. IMCA also delivered robust performance, surpassing traditional strategies such as the Min-Variance approach and the SET50 Baseline, further validating its dynamic and adaptive portfolio allocation methodology.
In terms of annual returns and cumulative returns, A2C recorded the highest values among the Deep Reinforcement Learning models, showcasing its ability to generate superior long-term growth. IMCA demonstrated robust performance, outperforming all traditional strategies, including Min-Variance CAPM and the SET50 Baseline, which confirms its effectiveness in achieving consistent and superior returns over time.
For annual volatility, Min-Variance achieved the lowest level, reflecting its conservative approach and appeal to risk-averse investors. However, DRL models, including IMCA, displayed moderate and manageable volatility levels, indicating their ability to balance risk and return effectively while remaining competitive.
When evaluating risk-adjusted performance through the Sharpe Ratio, A2C emerged as the top performer, achieving the highest value among all models. This metric highlights A2C’s efficiency in delivering returns relative to the risk taken. IMCA’s Sharpe Ratio further emphasizes its balanced approach, combining profitability with controlled risk exposure.
In terms of maximum drawdown, Min-Variance exhibited the smallest drawdowns, reinforcing its suitability for investors prioritizing capital preservation. IMCA maintained competitive drawdown levels, outperforming the SET50 Baseline and demonstrating resilience during market downturns. This resilience underscores IMCA’s robustness in adapting to adverse market conditions, further enhancing its appeal to portfolio managers seeking both growth and stability.
Overall,
Table 2 illustrates the versatility and adaptability of IMCA, which strikes a balance between profitability and risk management. It also highlights the comparative advantages of DRL models over traditional approaches in dynamic and volatile market environments.
The superior performance of A2C and IMCA can be explained by their ability to dynamically adapt to changing market conditions. For instance, during the COVID-19 pandemic, when the market experienced extreme volatility, A2C utilized its advantage function to stabilize decision-making. This ensured consistent adjustments to portfolio allocations, even during periods of significant market uncertainty. Similarly, IMCA’s ability to adjust model contributions in real-time allowed it to reduce the influence of poorly performing models and increase the weight of better-performing ones. This flexibility enabled the portfolio to recover faster and perform more effectively during the market rebound.
In contrast, traditional models like Min-Variance and CAPM rely on fixed weighting strategies. These models struggle to adjust to sudden market changes, as shown by their lower cumulative returns during the sharp market declines and recoveries during the COVID-19 crisis.
Furthermore, DRL models like SAC stood out in handling complex data and exploring a wide range of strategies. Instead of relying on a fixed plan, SAC actively tested and refined its strategies, ensuring that it could adapt to rapidly changing market conditions. This approach was particularly important during the uncertainty of the pandemic, as it helped the model avoid being locked into suboptimal solutions. Combined with IMCA’s flexibility to adapt based on real-time performance, this resulted in a powerful and effective system for managing portfolios even in the most challenging market environments.
4.3. Pre-COVID Outbreak (1 January 2018 to 31 December 2019)
The statistics above indicate that, during the pre-COVID period, the IMCA strategy outperformed both the Baseline and Minimum Variance strategies, providing the highest cumulative return and Sharpe Ratio. This suggests that IMC was particularly effective during stable market conditions, likely due to its ability to balance growth and risk.
The Baseline strategy, which mirrors a traditional index approach, showed moderate volatility but struggled with lower cumulative returns and a negative Sharpe Ratio. This performance suggests that conventional index-based strategies may be less effective in periods of steady market growth, where more adaptive models can capitalize on incremental gains.
In contrast, the Minimum Variance strategy offered lower risk, as shown by its lower volatility and smaller drawdowns. However, this emphasis on stability came at the cost of higher returns, resulting in a lower cumulative return than IMCA. Investors with a risk-averse approach may find Minimum Variance appealing, but the IMCA strategy stands out as the preferred option for those seeking growth without excessive risk in stable markets.
Figure 4 below illustrates the cumulative return trends, while
Table 3 provides a summary of key performance metrics for each strategy.
4.4. During-COVID Outbreak (1 January 2020 to 31 December 2021)
The statistics for the during-COVID period show that the market was highly volatile, largely due to the global economic disruptions from the COVID-19 pandemic. This period tested each strategy’s ability to handle sharp declines and rapid rebounds, making it an effective assessment of risk management and adaptability.
The IMCA strategy once again demonstrated strong performance, achieving the highest cumulative returns and a positive Sharpe Ratio despite the volatility. This suggests that IMCA was able to adjust to the rapid market fluctuations more effectively than the other strategies, making it a resilient choice during unpredictable times.
In comparison, the Baseline strategy, which mirrors a traditional market index, experienced high volatility and significant drawdowns, resulting in lower cumulative returns. This outcome highlights the limitations of conventional index-based approaches in times of crisis, as these strategies lack the flexibility to respond quickly to market downturns.
The Minimum Variance strategy, while focused on reducing risk, showed lower cumulative returns as well. Its priority on stability helped it to avoid the worst losses, as indicated by a smaller drawdown than the Baseline, but it still lagged behind IMCA in terms of overall growth. This suggests that, while suitable for risk-averse investors, Minimum Variance may sacrifice growth potential during highly volatile periods.
Figure 5 below illustrates the cumulative return trends, while
Table 4 provides a summary of key performance metrics for each strategy during the COVID-19 outbreak period.
This performance contrasts with the following period of market recovery, where different growth dynamics come into play.
4.5. Post-COVID Outbreak (1 January 2022 to 31 December 2023)
The statistics for the post-COVID period illustrate how each strategy adapted to the market’s recovery phase. During this time, economic conditions began to stabilize, and markets rebounded, offering growth opportunities. This period provides insights into each strategy’s ability to capitalize on recovery trends while managing residual volatility.
The IMCA strategy continued to outperform the other models, achieving the highest cumulative returns and a positive Sharpe Ratio. IMCA’s consistent performance highlights its adaptability and growth potential, making it a strong choice for investors seeking to maximize returns in a recovering market.
The Baseline strategy, following the traditional market index, showed some recovery but remained limited by higher drawdowns and moderate cumulative returns. This performance suggests that, while index-based strategies can participate in growth during favorable market conditions, they may still be impacted by lingering volatility, reducing their overall appeal in a recovery phase.
The Minimum Variance strategy demonstrated stability with lower volatility and drawdowns compared to the Baseline. However, its conservative approach led to relatively modest cumulative returns, reflecting a trade-off between stability and growth potential. This makes the Minimum Variance strategy suitable for investors prioritizing risk reduction over aggressive gains in a post-crisis environment.
Figure 6 below illustrates the cumulative return trends, while
Table 5 provides a summary of key performance metrics for each strategy during the post-COVID period.
4.6. Robustness Check
To verify the performance of the proposed IMCA model, we conduct two robust analyses. First, we evaluate the performance of IMCA while accounting for transaction costs. Second, we examine the impact of varying the timestep to assess the robustness and generalization of the Deep Reinforcement Learning model, which can also indirectly help to detect overfitting.
4.6.1. Measuring the Performance of Strategies Under Transaction Costs
To measure the performance of the IMCA model under transaction costs, we account for daily trading strategies that may involve restructuring positions multiple times per day, with a maximum of 50 trades daily. The profit and loss for these strategies, when applying a transaction cost of 0.02% per trade, are presented in
Table 6.
Compared to strategies without considering transaction costs (
Table 2),
Table 6 shows that the IMCA model demonstrates notable robustness when transaction costs are applied. While there is a reduction in its annual returns (from 2.32% to 2.10%) and cumulative returns (from 14.20% to 13.00%), IMCA remains one of the best-performing models. Its Sharpe ratio declines slightly (from 0.220 to 0.185), highlighting the inevitable impact of transaction costs on risk-adjusted returns while maintaining its strong performance relative to other strategies.
4.6.2. Evaluating Learning Progression
The proposed Iterative Model Combining Algorithm trading strategy consists of two main components: Deep Reinforcement Learning optimization and the calculation of optimal weights for each algorithm, as described in
Section 3. The IMCA dynamically adjusts to changing market conditions, extracting informative features from the environment to optimize portfolio performance. However, the complexity of the IMCA introduces potential challenges, such as sampling noise, which could lead to overfitting. To evaluate its learning progression, the reward vs. timesteps graph is employed, a fundamental tool in DRL that illustrates how well the model adapts and learns over time. In this experiment, the timesteps range from 10,000 to 200,000, as shown in
Figure 7.
Figure 7 shows the IMCA model’s learning progression over timesteps from 10,000 to 200,000. Initially, the average reward increases sharply from −19.03 to 34.31 (10,000 to 50,000 timesteps), indicating effective learning. The reward continues to rise, peaking at 55.98 by 100,000 timesteps, reflecting robust optimization. Beyond this, fluctuations occur, with a dip to 45 at 150,000 timesteps before recovering to 51 at 200,000, likely due to sampling noise or market sensitivity. Overall, the IMCA demonstrates strong learning and adaptability, though slight refinements could further stabilize performance in later stages.
4.7. Discussion
The results reveal several key insights that underline the effectiveness of the proposed models and methodologies.
First, the superiority of IMCA is evident from its consistent outperformance of traditional portfolio allocation methods. IMCA demonstrates adaptability and effectiveness, particularly in emerging markets like the SET50 Index, aligning with prior studies that highlight the importance of dynamic asset allocation in improving portfolio performance [
30,
39]. Its dynamic weighting mechanism, which leverages the strengths of multiple models, ensures resilience in the face of market fluctuations. This adaptability is consistent with findings that emphasize the importance of flexibility in managing volatile markets [
44,
45]. IMCA’s ability to balance risk and return makes it a versatile and reliable tool for portfolio management.
Second, insights from the performance of DRL algorithms highlight the strengths and limitations of individual models. A2C and SAC stand out as the most effective DRL models in managing risk and capitalizing on market opportunities. These results are supported by previous research that demonstrates the ability of DRL algorithms to adapt to complex environments and optimize long-term objectives [
46,
47]. In contrast, PPO’s underperformance emphasizes the critical importance of selecting DRL algorithms tailored to the specific complexities and challenges of financial markets, where adaptability and precision are key [
4,
19].
Finally, the results emphasizes the trade-offs between risk and return across the evaluated strategies. While Min-Variance offers the lowest risk among all methods, its limited growth potential aligns with the findings of traditional portfolio theory, which identifies the trade-off between minimizing risk and achieving higher returns [
30,
48]. These results highlight the necessity of adopting more dynamic approaches, such as IMCA, to achieve long-term investment objectives. Studies in the context of adaptive asset allocation strategies also validate this observation, suggesting that models incorporating real-time adjustments deliver superior outcomes in changing market environments [
21,
22].
Overall, the findings demonstrate that integrating DRL techniques with IMCA creates a robust and efficient portfolio management framework. This approach not only navigates volatile market conditions but also delivers superior risk-adjusted returns, consistent with prior studies that highlight the advantages of combining machine learning with portfolio optimization [
49]. These results make IMCA an ideal solution for investors seeking both stability and profitability in dynamic financial environments.
5. Conclusions
The findings highlight the superior performance of the Iterative Model Combining Algorithm, with an Annual Return of 2.32%, Cumulative Returns of 14.20%, and a Sharpe Ratio of 0.220, outperforming traditional models like the Minimum Variance strategy, which recorded a negative Annual Return of −0.77%, Cumulative Returns of −4.35%, and a Sharpe Ratio of 0.018. The high Sharpe Ratio and competitive Max Drawdown of IMCA and A2C underscore their ability to deliver strong returns while maintaining effective risk management. These results underscore IMCA’s ability to achieve consistent profitability while maintaining balanced risk exposure, as evidenced by its manageable annual volatility of 17.56% and competitive maximum drawdown of −44.78%.
The Advantage Actor–Critic model demonstrated even higher Annual Returns (2.78%) and Cumulative Returns (17.16%) among the DRL-based models, showcasing its effectiveness in long-term growth. Its Sharpe Ratio of 0.246, the highest across all models, reflects its efficiency in delivering risk-adjusted returns. These metrics collectively highlight the strength of DRL-based approaches in dynamically adapting to volatile market conditions, particularly during periods of extreme uncertainty, such as the COVID-19 pandemic.
The comparative analysis reveals that, while traditional strategies like the Minimum Variance approach prioritize risk minimization, they often fail to capitalize on upward market trends, leading to suboptimal returns. In contrast, IMCA and DRL models demonstrate a balanced approach, combining adaptability and profitability to deliver superior performance under stable and volatile market conditions.
This study demonstrates the potential of combining multiple DRL algorithms to optimize portfolio management, specifically for SET50 stocks. By utilizing an IMCA, we dynamically adjusted model weights to minimize forecasting errors, enabling our combined models to adapt effectively under varying market conditions. The findings indicate that this hybrid approach significantly outperforms traditional strategies, such as the Min-Variance strategy or the SET50 Baseline, in both return generation and risk management.
Our results highlight that DRL models, when implemented with diverse algorithms and optimized using techniques like IMCA, provide a robust and adaptive framework for trading in volatile financial markets. Notably, algorithms such as Advantage Actor–Critic (A2C) and Soft Actor–Critic (SAC) demonstrated resilience and delivered higher Cumulative Returns, particularly during periods of extreme market turbulence, such as the COVID-19 pandemic. These models excelled in balancing risk and return, underscoring the adaptability and effectiveness of DRL in managing dynamic and unpredictable market environments.
The superior performance of A2C and IMCA can be explained by their ability to dynamically adapt to changing market conditions. For instance, during the COVID-19 pandemic, when the market experienced extreme volatility, A2C utilized its advantage function to stabilize decision-making. This ensured consistent adjustments to portfolio allocations, even during periods of significant market uncertainty. Similarly, IMCA’s ability to adjust model contributions in real time allowed it to reduce the influence of poorly performing models and increase the weight of better-performing ones. IMCA dynamically reduces the influence of underperforming models, such as those impacted by short-term market anomalies, while amplifying the contribution of consistently high-performing models. This flexibility enabled the portfolio to recover faster and perform more effectively during the market rebound.
Future research could explore the integration of additional DRL models or the application of this framework to other market environments, such as commodities or developed stock indices, to validate its versatility. Further investigations into the incorporation of transaction costs, liquidity constraints, and other real-world factors would enhance the practical applicability of the approach. Additionally, optimizing hyperparameters and exploring advanced ensemble techniques, such as those incorporating sentiment analysis or macroeconomic indicators, could further improve forecasting accuracy and profitability [
50,
51]. Conducting comparative experiments with alternative ensemble and machine learning methods, such as Double Deep Q-Learning and Trust Region Policy Optimization, would help to validate the superiority of the IMCA framework. Moreover, testing the framework on more extensive portfolios would provide insights into its scalability and effectiveness across different asset classes. Overall, this study establishes a strong foundation for leveraging DRL and IMCA in the development of advanced portfolio management systems.