Next Article in Journal
Trajectory Tracking of WMR with Neural Adaptive Correction
Next Article in Special Issue
Application of Natural Generalized-Laplace Transform and Its Properties
Previous Article in Journal
Weather-Corrupted Image Enhancement with Removal-Raindrop Diffusion and Mutual Image Translation Modules
Previous Article in Special Issue
Horváth Spaces and a Representations of the Fourier Transform and Convolution
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Composition of Activation Functions and the Reduction to Finite Domain

by
George A. Anastassiou
Department of Mathematical Sciences, University of Memphis, Memphis, TN 38152, USA
Mathematics 2025, 13(19), 3177; https://doi.org/10.3390/math13193177
Submission received: 3 September 2025 / Revised: 23 September 2025 / Accepted: 30 September 2025 / Published: 3 October 2025
(This article belongs to the Special Issue Special Functions with Applications)

Abstract

This work takes up the task of the determination of the rate of pointwise and uniform convergences to the unit operator of the “normalized cusp neural network operators”. The cusp is a compact support activation function, which is the composition of two general activation functions having as domain the whole real line. These convergences are given via the modulus of continuity of the engaged function or its derivative in the form of Jackson type inequalities. The composition of activation functions aims to more flexible and powerful neural networks, introducing for the first time the reduction in infinite domains to the one domain of compact support.

1. Introduction

From AI and computer science, we have the following: in essence, composing activation functions in neural networks offers the advantage of potentially tailoring the network’s ability to learn and model complex, non-linear relationships in data. Here is a breakdown of the potential benefits:
  • Enhanced Capacity for Complex Modeling:
    • Diversification of Non-linearity: Different activation functions have different characteristics. For example, ReLU introduces sparsity, while Sigmoid squashes values into a range. By composing them, the network potentially can learn a wider variety of non-linear transformations and capture more intricate patterns in the data.
  • Improved Training Dynamics:
    • Mitigating Gradient Problems: Activation functions influence gradient flow during training. Using different activation functions can potentially help address issues like vanishing or exploding gradients, which hinder learning in deep networks.
    • Faster Convergence: Certain activation functions, like ReLU, can accelerate the convergence of the training process compared to others like Sigmoid or Tanh. Combining different functions can potentially lead to faster training and competitive performance.
  • Enhanced Generalization and Robustness:
    • Better Generalization: By learning richer representations of the data through diverse activation functions, the network’s ability to generalize well to unseen data improves, reducing the risk of overfitting.
    • Increased Robustness: Networks with carefully chosen activation functions can handle variations in input data more effectively, adapting to noise, missing data, or unexpected perturbations.
  • Adaptation to Input Characteristics:
    • Handling Diverse Data: Different activation functions can be suited to different data characteristics. For instance, tanh can be useful when dealing with data containing both positive and negative values.
  • Potential for Architectural Interpretability:
    • Insight into Learning: By using distinct activation functions, different parts of the network might become responsible for capturing specific features, which can potentially offer insights into how the model learns.
In summary, composing activation functions potentially allows for a more flexible and powerful neural network capable of
  • Learning more complex patterns.
  • Faster and more stable training.
  • Better generalization to new data.
  • Greater adaptability to diverse data.
Attention: While composing activation functions can offer benefits, it’s important to choose them judiciously and with consideration for the specific problem at hand, as some combinations might not be beneficial or could even lead to unwanted behaviors like exploding gradients. Empirical testing and validation are crucial when exploring different activation function compositions.
The author was greatly inspired and motivated by [1] and was the pioneer of quantitative neural network approximation, see [2], and since then, he has published numerous of papers and books, e.g., see [3].
In this article, we continue this trend.
In mathematical neural network approximation AMS Mathscinet lists no articles related to composition of activation functions. So this is the first one of its kind.
By using the composition of activation functions, we achieve the first extensive part of this introduction, and most notably, this composition leads to an activation function of compact support, though the initial activation functions had an infinite domain, the whole real line.
Now the resulting activation function is an open cusp of compact support 1 , 1 . Our involved activation functions are very general, and the constructed neural network operators resemble the squashing operators in [2,3], and so do the produced quantitative results.
As a result, our produced convergence inequalities look much simpler and nicer.
Of great inspiration are the articles [4,5,6]. References [7,8,9] are foundational. Finally, references [10,11,12,13] represent recent important works.

2. Basics

Let i = 1 , 2 , and h i : R 1 , 1 be general sigmoid activation functions, such that they are strictly increasing, h i 0 = 0 , h i x = h i x , x R , h i + = 1 , h i = 1 . Also, h i is strictly convex over ( , 0 ] and strictly concave over [ 0 , + ) , with h i 2 C R .
Clearly, h 1 h 2 = h 1 | 1 , 1 h 2 is strictly increasing and h 1 h 2 0 = 0 , and
h 1 h 2 x = h 1 h 2 x = h 1 h 2 x = h 1 h 2 x = h 1 h 2 x ,
that is
h 1 h 2 x = h 1 h 2 x , x R .
Furthermore,
h 1 h 2 + = h 1 h 2 + = h 1 1 ,
h 1 h 2 = h 1 h 2 = h 1 1 .
Next, acting over ( , 0 ] : let λ , μ 0 : λ + μ = 1 . Then, by convexity of h 2 there we have
h 2 λ x + μ y λ h 2 x + μ h 2 y , x , y R ;
and
h 1 h 2 λ x + μ y h 1 λ h 2 x + μ h 2 y λ h 1 h 2 x + μ h 1 h 2 y ,
i.e.,
h 1 h 2 λ x + μ y λ h 1 h 2 x + μ h 1 h 2 y ,
x , y R .
So that h 1 h 2 is convex over ( , 0 ] .
Similarly, over [ 0 , + ) , we get: let λ , μ 0 : λ + μ = 1 . Then, by concavity of h 2 there we have
h 2 λ x + μ y λ h 2 x + μ h 2 y , x , y R ;
and
h 1 h 2 λ x + μ y h 1 λ h 2 x + μ h 2 y λ h 1 h 2 x + μ h 1 h 2 y .
Therefore h 1 h 2 is concave over [ 0 , + ) .
Also, it is
( h 1 h 2 x ) = h 1 h 2 x h 2 x 2 + h 1 h 2 x h 2 x C R , x R .
So h 1 h 2 is a sigmoid activation function.
Next we consider the function
ψ 1 , 2 x : = 1 4 h 1 h 2 x + 1 h 1 h 2 x 1 > 0 , x R .
We observe that
ψ 1 , 2 x = 1 4 h 1 h 2 x + 1 h 1 h 2 x 1 =
1 4 h 1 h 2 x 1 h 1 h 2 x + 1 =
1 4 h 1 h 2 x 1 h 1 h 2 x + 1 =
1 4 h 1 h 2 x 1 + h 1 h 2 x + 1 =
1 4 h 1 h 2 x + 1 h 1 h 2 x 1 = ψ 1 , 2 x ,
that is
ψ 1 , 2 x = ψ 1 , 2 x , x R .
So ψ 1 , 2 can serve as a density function in general.
So we have h 2 : R 1 , 1 , h 1 | 1 , 1 : 1 , 1 1 , 1 , and the strictly increasing function H : = h 1 | 1 , 1 h 2 : R 1 , 1 , with the graph of H being an open ends arc of finite length, such that H 0 = 0 , starting at 1 , 1 and terminating at 1 , 1 . In particular, H is negative and convex over ( 1 , 0 ] , and it is positive and concave over [ 0 , 1 ) .
So it has compact support 1 , 1 and it is like a squashing function, see [3], ch. 1, p. 8.
We will work from now on with H , which has as a graph an open cusp joining the points 1 , 1 , 0 , 0 , 1 , 1 and with compact support, again, 1 , 1 . The points 1 , 1 , 1 , 1 do not belong to the graph of H ; however, 0 , 0 does.

3. Background

Here we consider functions f : R R to be either continuous and bounded, or uniformly continuous.
The first modulus of continuity is given by
ω 1 f , δ : = sup x , y R : x y δ f x f y , δ > 0 .
Here we have that ω 1 f , δ < + , δ > 0 .
In this article, we study the pointwise and uniform convergences with rates over the real line, to the unit operator, of the “normalized cusp neural network operators”,
A n f x : = k = n 2 n 2 f k n H n 1 α x k n k = n 2 n 2 H n 1 α x k n ,
where 0 < α < 1 and x R , n N .
Notice A n is a positive linear operator with A n 1 = 1 .
The terms in the ratio of sums (1) can be non-negative and make sense, iff 0 < n 1 α x k n 1 , i.e., 0 < x k n 1 n 1 α , iff
n x n α x n x + n α , x k n .
In order to have the desired order of numbers
n 2 n x n α n x + n α n 2 , x k n ,
it is sufficient to assume that
n 1 + x , x k n .
When x 1 , 1 , x k n , it is enough to assume n 2 , which implies (3), and x k n .
But the unique case x = k n contributes nothing and can be ignored.
Thus, without loss of generality, we can always take that x k n .
Proposition 1
([2]). Let a , b R , a b . Let c a r d k 0 be the maximum number of integers contained in a , b . Then
max 0 , b a 1 c a r d k b a + 1 .
Note 1.
We would like to establish a lower bound on c a r d k over the interval n x n α , n x + n α . By Proposition 1, we get that
c a r d k max 2 n α 1 , 0 .
We obtain c a r d k 1 , if 2 n α 1 1 iff n 1 , which is always true.
So to have the desired order 3 and c a r d k 1 over n x n α , n x + n α , it is enough to consider
n max 1 + x , 1 = 1 + x .
Also notice that c a r d k + , as n + .
Denote · as the integral part of a number and · as its ceiling.
Thus, it is clear that
A n f x : = k = n x n α n x + n α f k n H n 1 α x k n k = n x n α n x + n α H n 1 α x k n ,
0 < α < 1 , and n N : n 1 + x , x R .

4. Main Results

Next come our first main results.
Theorem 1.
Let x R , n N : n 1 + x , f : R R is either continuous and bounded, or uniformly continuous. Then
A n f x f x ω 1 f , 1 n 1 α ,
where ω 1 is the first modulus of continuity of f. Hence, lim n + A n f x = f x , pointwise, given f is uniformly continuous.
When n 2 , we obtain
A n f f , 1 , 1 ω 1 f , 1 n 1 α .
Hence lim n + A n f = f , uniformly over 1 , 1 , given f isuni f ormlycontinuous.
Proof. 
A n f x f x = k = n x n α n x + n α f k n H n 1 α x k n k = n x n α n x + n α H n 1 α x k n f x =
k = n x n α n x + n α f k n f x H n 1 α x k n k = n x n α n x + n α H n 1 α x k n
k = n x n α n x + n α f k n f x H n 1 α x k n k = n x n α n x + n α H n 1 α x k n
w 1 f , 1 n 1 α k = n x n α n x + n α H n 1 α x k n k = n x n α n x + n α H n 1 α x k n = w 1 f , 1 n 1 α .
We continue with our second main result.
Theorem 2.
Let x R , n N : n 1 + x . Let f C N R , N N , such that f N is a continuous and bounded function or a uniformly continuous function. Then
A n f x f x j = 1 N f j x n j 1 α j ! + ω 1 f N , 1 n 1 α 1 N ! n N 1 α .
Notice that as n + , we have that the right-hand side (R.H.S.) (10) 0 ; therefore, the left-hand side (L.H.S.) (10) 0 , i.e., Equation (10) gives us with rates the pointwise convergence of A n f x f x , as n + , x R .
Proof. 
With Taylor’s formula, we have
f k n = j = 0 N 1 f j x j ! k n x j + x k n f N t f N x k n t N 1 N 1 ! d t .
Call
W x : = k = n x n α n x + n α H n 1 α x k n .
Hence
f k n H n 1 α x k n W x = j = 0 N f j x j ! k n x j H n 1 α x k n W x +
H n 1 α x k n W x x k n f N t f N x k n t N 1 N 1 ! d t .
Thus
A n f x f x = k = n x n α n x + n α f k n H n 1 α x k n W x f x
= j = 0 N f j x j ! k = n x n α n x + n α k n x j H n 1 α x k n W x + F ,
where
F : = k = n x n α n x + n α H n 1 α x k n W x x k n f N t f N x k n t N 1 N 1 ! d t .
So that
A n f x f x
j = 0 N f j x j ! k = n x n α n x + n α 1 n j 1 α H n 1 α x k n W x + F .
And hence
A n f x f x j = 0 N f j x n j 1 α j ! + F .
Next we estimate
F = k = n x n α n x + n α H n 1 α x k n W x x k n f N t f N x k n t N 1 N 1 ! d t
k = n x n α n x + n α H n 1 α x k n W x x k n f N t f N x k n t N 1 N 1 ! d t
k = n x n α n x + n α H n 1 α x k n W x ρ ,
where
ρ : = x k n f N t f N x k n t N 1 N 1 ! d t
and
k = n x n α n x + n α H n 1 α x k n W x ψ = ψ ,
where
ψ : = ω 1 f N , 1 n 1 α 1 n ! n N 1 α .
The last part of inequality (18) comes from the following:
(i) Let x k n , then
ρ = x k n f N t f N x k n t N 1 N 1 ! d t
x k n ω 1 f N , t x k n t N 1 N 1 ! d t
ω 1 f N , x k n x k n k n t N 1 N 1 ! d t
ω 1 f N , 1 n 1 α k n x N N ! ω 1 f N , 1 n 1 α 1 N ! n N 1 α ;
i.e., when x k n we get
ρ ω 1 f N , 1 n 1 α 1 N ! n N 1 α .
(ii) Let x k n , then
ρ = k n x f N t f N x t k n N 1 N 1 ! d t =
k n x f N t f N x t k n N 1 N 1 ! d t
k n x ω 1 f N , t x t k n N 1 N 1 ! d t
ω 1 f N , x k n k n x t k n N 1 N 1 ! d t =
ω 1 f N , x k n x k n N N ! ω 1 f N , 1 n 1 α 1 N ! n N 1 α = ψ .
Thus, we get in both cases
ρ ω 1 f N , 1 n 1 α 1 N ! n N 1 α .
Therefore, from (16), (18), (19) and (21), we obtain
F ω 1 f N , 1 n 1 α 1 N ! n N 1 α .
At the end from (15) and (22), we derive inequality (10). □
Corollary 1
(to Theorem 2). It holds
A n f x f x
j = 1 N f j x j ! k = n x n α n x + n α k n x j H n 1 α x k n k = n x n α n x + n α H n 1 α x k n
ω 1 f N , 1 n 1 α 1 N ! n N 1 α .
Proof. 
By (13) and (22). □
Corollary 2
(to Theorem 1). Let x ϕ , ϕ , ϕ > 0 , and n N be such that n 1 + ϕ , 0 < α < 1 . Consider p 1 . Then
A n f f p , ϕ , ϕ ω 1 f , 1 n 1 α 2 1 p ϕ 1 p ,
By (24), we derive the L p convergence of A n f to f with rates, given f is uniformly continuous.
We finish with
Corollary 3
(to Theorem 2). In the assumptions of Theorem 2 and Corollary 2 we have
A n f f p , ϕ , ϕ j = 1 N f j p , ϕ , ϕ n j 1 α j ! +
ω 1 f N , 1 n 1 α 2 1 p ϕ 1 p N ! n N 1 α , N N .
By (25) we derive again the L p convergence of A n f to f with rates.
Proof. 
Inequality (25) comes by integration of (10) over ϕ , ϕ and the triangle and homogeneity properties of the L p norm. □

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Cardaliaguet, P.; Euvrard, G. Approximation of a function and its derivative with a neural network. Neural Netw. 1992, 5, 207–220. [Google Scholar] [CrossRef]
  2. Anastassiou, G.A. Rate of Convergence of some neural network operators to the unit—Univariate case. J. Math. Appl. 1997, 22, 237–262. [Google Scholar] [CrossRef]
  3. Anastassiou, G.A. Intelligent Systems II: Complete Approximation by Neural Network Operators; Springer: Heidelberg, Germany; New York, NY, USA, 2016. [Google Scholar]
  4. Chen, Z.; Cao, F. The approximation operators with sigmoidal functions. Comput. Math. Appl. 2009, 58, 758–765. [Google Scholar] [CrossRef]
  5. Costarelli, D.; Spigler, R. Approximation results for neural network operators activated by sigmoidal functions. Neural Netw. 2013, 44, 101–106. [Google Scholar] [CrossRef] [PubMed]
  6. Costarelli, D.; Spigler, R. Multivariate neural network operators with sigmoidal activation functions. Neural Netw. 2013, 48, 72–77. [Google Scholar] [CrossRef] [PubMed]
  7. Haykin, S. Neural Networks: A Comprehensive Foundation, 2nd ed.; Prentice Hall: New York, NY, USA, 1998. [Google Scholar]
  8. McCulloch, W.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 7, 115–133. [Google Scholar] [CrossRef]
  9. Mitchell, T.M. Machine Learning; WCB-McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
  10. Yu, D.S.; Cao, F.L. Construction and approximation rate for feed-forward neural network operators with sigmoidal functions. J. Comput. Appl. Math. 2025, 453, 116150. [Google Scholar] [CrossRef]
  11. Cen, S.; Jin, B.; Quan, Q.; Zhou, Z. Hybrid neural-network FEM approximation of diffusion coeficient in elyptic and parabolic problems. IMA J. Numer. Anal. 2024, 44, 3059–3093. [Google Scholar] [CrossRef]
  12. Coroianu, L.; Costarelli, D.; Natale, M.; Pantiş, A. The approximation capabilities of Durrmeyer-type neural network operators. J. Appl. Math. Comput. 2024, 70, 4581–4599. [Google Scholar] [CrossRef]
  13. Warin, X. The GroupMax neural network approximation of convex functions. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 11608–11612. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Anastassiou, G.A. Composition of Activation Functions and the Reduction to Finite Domain. Mathematics 2025, 13, 3177. https://doi.org/10.3390/math13193177

AMA Style

Anastassiou GA. Composition of Activation Functions and the Reduction to Finite Domain. Mathematics. 2025; 13(19):3177. https://doi.org/10.3390/math13193177

Chicago/Turabian Style

Anastassiou, George A. 2025. "Composition of Activation Functions and the Reduction to Finite Domain" Mathematics 13, no. 19: 3177. https://doi.org/10.3390/math13193177

APA Style

Anastassiou, G. A. (2025). Composition of Activation Functions and the Reduction to Finite Domain. Mathematics, 13(19), 3177. https://doi.org/10.3390/math13193177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop