2. Some Basic Notations
Let denote the set of all the natural numbers. Let denote the set of all the real numbers. Let denote the set of all the real numbers, including infinities. and are the floor and ceiling functions evaluated at the real value , respectively; in particular, . For convenience, let us simply define for all , the set of all the non-negative real numbers. Let denote the set for any arbitrary . Let denote the vector of . Let the Cartesian product set denote the set and denote the set for any . Let the n-ary Cartesian product denote the set for all . If S is a set, we use to denote the size (cardinality) of the set S. Let , the k-ary Cartesian product over sets , be an n-ary column vector. Let (or ) denote the mth element in the vector . Let denote the length of vector . Let be arbitrary. Let be arbitrary. Suppose A is an arbitrary k-dimensional tensor after all the preliminary processing of the original object (for example, the padding of an input text or image), or A could be identified by a function (or ), where . Let denote the stride vector, where denotes the number of strides taken in the ith dimension at the jth convolutional layer. Let -sectional tensor denotes the ith feature/filter in the jth layer. Let , where denotes the number of filters/features in layer j.
3. Basic Definitions and Properties
Definition 1. We use to denote a tensor with sectional/directional vector , i.e., , where is the number of sections (or directions) of . If , we use to denote its partial function, i.e., . Furthermore, .
Remark 1. In most of the cases, kernels are representable by a product set (or tensor product). But in some cases, we might need to consider the irregular kernels, i.e., those that could not be represented directly by product sets. If that is the case, in our setting, we could simply pad the undefined cell by 0. For example, a kernel could be extended to a product set in which . Such a setting could rescue us from redesigning the setting/modeling from scratch. This leads to another question: Why, or under which circumstance, shall we tend to adopt irregular kernels? An obvious one is that even the input data is not representable by a product set or a tensor. If this is the case, we could apply the same technique to pad the undefined cells by 0. Such accommodation suits our setting and is much more consistent with the settings of standard CNNs.
Definition 2. Let denote a stride vector whose elements indicate the strides with respect to each section (or direction).
Definition 3. Let denote a filter (tensor), or kernel, where is its sectional vector, i.e., . If we need to denote the ith filter in the jth layer, we use the notation .
Remark 2. For each kernel (the ith kernel in the jth layer), we could associate it with a stride vector . This is based on the assumption that a uniform stride is applicable and appropriate for the sliding and inner product operation. In some cases, we might encounter non-uniform strides [23]. A much more generalized setting is where the strides (or the positions where the inner product takes place) are determined by a set of positional vectors (for example, such positions are randomly chosen). Then, this case is beyond our setting, since our setting is based on the standard CNN in which the strides normally are assumed to be uniform. In order not to reformulate our setting, we could add another masking layer after the feature maps [24]. Therefore, all the stride vectors consist of . Then, each feature map acts on the mask consisting of all values of 1 if the feature map is activated and 0 if the feature map is not activated. After this masking layer, in the pooling layer, one gets rid of all the values of 0 and their associated positions. The remaining values are linearized and fed into the ANN part. Remark 3. Another interesting aspect of feature extraction is the dilated convolution neural network [25,26]. Since normally the dilated parts are padded by 0, the setting is still comparable with our setting regarding kernels. This is because the parameters to be learned lie in the kernels. To accommodate our setting with the dilated kernel is to fix the padded parameters and to learn the other non-padded parts in the kernel. In order to present the cropped sub-tensors of , given the filter and stride vector , we use a power tensor to collect such sub-tensors as follows:
Definition 4 (power tensor one). Let , where , where ⊙ indicates the Hadamard product and .
Remark 4. In the definition, one observes that the floor function might contain a denominator of 0; this is normally defined by extended real numbers. As long as the numerator is non-negative, we could regard the evaluated value of the floor function as 0. As for a non-negative numerator, this is normally true by adding padding into the CNN systems. The same argument goes for the following definitions containing denominator 0.
Definition 5 (power tensor two). Let , where .
Definition 6 (maximum pooling function)
. Definebyfor all . For a standard CNN, one could consult
Figure 1. The figure shows a complete data set training. Suppose there are
B training data points, i.e.,
. The process consists mainly of three operators: convolution, activation function
a, and pooling function
p. The convolution operator (or inner product) acts on the input tensor and the filters/features.
(
denotes the
nth filter at the
mth recursive step for the
jth input, and
is the dimensional vector of the filter). The activation function is denoted by
a, while the pooling function is denoted by
p. In the whole process, we insert an auxiliary operator
(named a power tensor operator), where
describes the strides assigned for all the dimensions and
is the dimensional vector of a truncated tensor (the tensor used to truncate its preceding tensor). The floor function is used to calculate the dimensional vector of the truncated tensors. Finally, the resulting (column) vector
comes from flattening and stacking all the pooled tensors at the
mth recursive step for the
jth input. In this diagram, we do not directly take padding into consideration. It could be done by a slight tweaking of the setting. In addition,
is defined to be
, where (suppose we fix the number)
N is the total number of recursive steps for the backward induction.
, where
. Moreover,
and
.
A high-dimension tensor
is fed into the CNN system. The cropped sub-tensors
interact with the sequence of filters/kernels/tensors
to yield a sequence of feature maps/inner products
After applying the activation function
a on the feature maps, one has
Then, one forms their sub-tensors via a sequence of power tensors
and yields
Next, one applies pooling functions,
p, on this set to form
Lastly, one flattens them into one column vector
with respect to
and
n via
f and
:
This could also be represented by the diagram presented in
Figure 2.
In order to perform the backward propagation/pass/induction, we take the max pooling function
p and keep track of all their indices as follows (fix
n):
for all
and for all
.
Example 1. Suppose is a 6-by-7 matrix (tensor), where . Suppose the stride vector and . Then, and . The convolution between T and some feature matrix is shown in Figure 3. As for the procedure of linearizing the power tensor, one starts from linearizing each whose size is . Hence, the total neurons in this layer should be , i.e., the power tensor is stacked by the sequence , where and each is a column vector with the length .
4. Theoretical Settings
As for the procedure of linearizing the power tensor, one starts by linearizing each whose size is . Hence, the total neurons in this layer should be , i.e., the power tensor is stacked by the sequence , where and each is a column vector with the length .
Definition 7 (). For any column vector and , define and .
Definition 8. Define the positional window/vector in the ith dimension by .
Claim 1 (dimensional multipliers). If , and the stride vector is , then , i.e., .
Proof. Based on the concept of arithmetic sequence and the given conditions, for each , there exists a maximum such that , i.e., . Since , the result follows immediately via the representation of the floor function. □
Definition 9. Let denote the column vector .
Definition 10. The set of all the positional indices of (or ) is defined by
Lemma 1 (-index tensor). The -index tensor of T is (a partial function), where the partial domain , where .
Proof. Based on Definitions 7–10 and Claim 1, the results follow immediately. □
Claim 2 (linearity for matrix). The function defined by is a bijective function for any given .
Proof. Since , it suffices to show that l is injective. Suppose , or , i.e., . Since , i.e., , i.e., , one has and thus . □
Lemma 2. (linearity of spatial indices of a tensor) The function defined by is a bijective function.
Proof. We show this by mathematical induction. For , it is shown to be true in Claim 2. Suppose holds, i.e., for any arbitrary , one has . Next, we show also holds. Suppose . Then, , i.e., . Since , one has , i.e., . Now, set . Then, , i.e., . Based on the mathematical assumption, one has for all . □
Lemma 2 is vital in converting a tensor-style representation of a CNN into a neuron-style representation of a CNN, as demonstrated in Corollaries 1–6.
Corollary 1 (vectorization/linearization of the indices of an input tensor ). is a bijective function if is defined in the same manner as l in Lemma 2.
Corollary 2. (vectorization/linearization of the indices of sub-tensors or power tensors of ) is a bijective function if is defined in the same manner as l in Lemma 2.
Proof. If we take and , then the result follows immediately. □
Corollary 3 (vectorization/linearization of the number of convolutional operations, or sum of product sop, between
and all the kernels in a given layer).
is a bijective function if is defined in the manner of l in Lemma 2, where is the number of filters (see Figure 4). Corollary 4 (vectorization/linearization of the indices of a kernel/filter). is a bijective function if is defined in the same manner as l in Lemma 2.
Corollary 5. , where, by applying the floor and ceiling functions,
;
, where and ;
, where and . Repeat the whole process until
, where and ;
, where and .
Remark 5. Since are defined over the sizes of the sections (or sectional vectors), any further expansion, such as zero-padding of T (the input object with sectional vector τ and zero-padding sectional vector ) will not alter its bijectivity, i.e., are all bijective (this could be easily shown via Lemma 2).
Corollary 6. For the ith filter in the layer (stage) j, or , the filter could be linearized by for all , i.e., in the jth layer (stage) related to the feature maps, there are linearized nodes whose values are decided by the presented results.
Proof. For each , , by Lemma 2, the result follows immediately. □
Example 2. Suppose T is a 6-by-7 matrix. Suppose the stride vector . . Then, , . Based on Claim 1, , . Suppose . Based on Lemma 1, and thus . Hence, could be represented by .
Example 3. Suppose T is a 26-by-37-by-19-by-28 tensor. Suppose the stride vector . . Then, , . Based on Claim 1, , . Suppose . Based on Lemma 1, and thus . Hence, could be represented by , where .
Definition 11. , or a tensor consisting of the standard inner products between input tensors and the feature tensor at stage (layer) j.
Now, each of these inner products is captured by the function sum of products, as we did for the fully connected network. With Definition 11 and Corollary 6, we could reach the conclusion that the total number of neurons in this layer is . Since we are going to look into the forward and backward propagations, we partition the indices of the tensors of the previous layer.
5. Theories and Methods
Claim 3. The number of neurons in the first layer isunder the linearization in Corollary 1, and the value associated with each neuronkisfor all, whereis defined in Corollary 1.
Claim 4. The number of neurons in the second layer is, where is the number of filters/kernels, i.e., the label set for the neurons in the second layer is, where, wherestands for ‘sum of product’ of thejth node in the referred layer (or in the standard CNN, it is thejth convoluted value).
Lemma 3. The indices from the first layer described in Claim 3 that links to the jth neuron in the second layer described in Claim 4 is the set , where , and , if , and where .
Proof. Given an index j in the second layer, based on Corollary 3: , one has , where . Then, based on Corrollary 2, , one has , which, via Definition 4 and Corollary 1, links to the result. □
Theorem 1. , where and .
Proof. Based on Corollary 3, the result follows immediately. □
Theorem 2. , for all and ; ; is defined in Definition 3.
Proof. Let us define a (weight) function
. Before that, let us define some settings. Since there are
sub-tensors, or
, and the indices for the
th sub-tensor (the size of each sub-tensor is
), based on the proof in Lemma 3, are
. Now, we define
. If one observes
j, they will find
j is decided by a pair of values, say
. Since
i could link to all the nodes in the next layer (intermediate output layer, or SOP layer), the value of
would need to involve
j as well. Then, we have to check whether
i is activated (convoluted) with respect to
j, or more precisely,
i is located in the
th sub-tensor, where
. In summary, neuron
i is activated (convoluted) if
and its activated value is defined to be
, where
i will indicate the
ith linearized element in the
filter/kernel and is decided by
. Hence,
is defined as the statement. The whole argument could also refer to
Figure 4 and
Figure 5. □
Remark 6. is a set of weights between an input layer, whose input object/tensor is , with neurons and an intermediate output layer, whose intermediate output object/values is (sum of product), with neurons. is the weight between the ith neuron in the input layer and the jth neuron in the intermediate output neuron.
Theorem 3. The CNN is represented by two networks: a partial network (or parameterized fully connected network) for feature extraction and a fully connected network.
Proof. The result follows immediately from Theorems 1 and 2 (see
Figure 5). □
6. Illustrative Example
Since the main difference between the standard CNN (sCNN) and this linearized CNN (lCNN) lies in the encoding part, the calculation of the set of weights
is vital. In this section, we exploit R programming (version 4.5.1) to demonstrate the equivalence between the sCNN and lCNN, i.e., an illustrative example for Theorems 2 and 3 (see
https://github.com/raymingchen/linearized-high-dimension-CNN.git, accessed on 1 September 2025). Suppose the input image is a cute puppy (
https://unsplash.com/photos/brown-short-coated-dog-on-green-grass-field-dgr0ZDbeOqw, a 3D RGB tensor, accessed on 1 September 2025). To simplify the computation, we extract the R part of the input tensor, or puppyR (a 200-by-200 matrix, or (1,2) cell at
Figure 6) as our input tensor. Suppose we consider four kernels (the first two are randomly generated, while the last two are some deterministic features):
; ; .
Then, we perform the standard CNN (or sCNN) and linearized CNN (or lCNN) for the convolution (this operation is sufficient to show whether the two representations are equivalent). The convoluted results based on sCNN and lCNN are shown in the 4th and 5th rows in
Figure 6. The stride vector is fixed at
. Based on these, one finds that the number of power tensors (or sub-tensors) with respect to
are
, respectively. For the linearized CNN, there are
= 40,000 nodes in the first/input layer, and there are 2535 + 2640 + 2640 + 2640 = 10,455 nodes in the second layer (intermediate output, or sop layer). Then, we construct (based on Theorem 4) the weights linking layer one and layer two by a 40,000-by-10,455 weight matrix
W. Then, the values associated with the nodes of the 2nd layer are computed by a 10,455-by-1 column vector
, where
t denotes the transpose operator, and
is the linearized 40,000-by-1 column vector of puppyR. In order to show the equivalence of sCNN and lCNN, we have to reshape
into four matrices: a 65-by-39 matrix
, and three 66-by-40
matrices. The feature map of sCNN and
of lCNN are identical in the convoluted operation. Since the operations (forward/backward propagation) are also identical regardless of whether it is an sCNN or lCNN, we have demonstrated the equivalence between the two representations.