1. Introduction
Although Shannon’s information theory [
1] has achieved remarkable success, it faces three significant limitations that restrict its applications to semantic communication and machine learning. First, it cannot measure semantic information. Second, it relies on the distortion function to evaluate communication quality, but the distortion function is subjectively defined and lacks an objective standard. Third, it is challenging to incorporate model parameters into Shannon’s entropy and information formulas. In contrast, machine learning often uses cross-entropy and cross-mutual information involving model parameters. Moreover, the minimum distortion criterion resembles the philosophy of the “absence of fault is a virtue”, whereas a more desirable principle might be “merit outweighing fault is a virtue”. Why did Shannon’s information theory use the distortion criterion instead of the information criterion? This is intriguing.
The study of semantic information gained attention soon after Shannon’s theory emerged. Weaver initiated research on semantic information and information utility [
2], and Carnap and Bar-Hillel proposed a semantic information theory [
3]. Thirty years ago, the author of this article generalized Shannon’s theory to a semantic information theory [
4,
5,
6]. Now, we call it G theory for short [
7]. Earlier contributions to semantic information theories include works by Carnap and Bar-Hillel [
3], Dretske [
8], Wu [
9], and Zhong [
10], while contributions after the author’s generalization include those by Floridi [
11,
12] and others [
13]. These theories primarily address natural language information and semantic information measures for daily semantic communication, involving the philosophical foundation of semantic information.
In the past decade, research on semantic communication and semantic information theory has developed rapidly in the following two fields. One field is electronic communication. The demand for the sixth generation high-speed internet has led to the emergence of some new semantic information theories or methods [
14,
15,
16], which pay more attention to electronic semantic communication, especially semantic compression and distortion [
17,
18,
19]. These studies have a high practical value. However, they mainly address electronic semantic information without considering how to measure the information in daily semantic communication, such as information conveyed by a prediction, a GPS pointer, a color perception, or a label. (All the abbreviations with original texts are listed in the back matter).
The other field is machine learning. Now, cross-entropy, posterior cross-entropy
H(
X|
Yθ) (roughly speaking, it is Variational Free Energy (VFE) [
20,
21,
22]), semantic similarity, estimating mutual information (MI) [
23,
24,
25], regularization distortion [
26], etc., are widely used in the field of machine learning. These information or entropy measures are used to optimize model parameters and latent variables [
27,
28,
29], achieving significant success. However, machine learning researchers rarely talk about “semantic information theory”. An important reason is that these authors have not found that estimating MI is a special case of semantic MI, and source entropy
H(
X) minus VFE equals semantic MI. So, they also did not propose a general measure of semantic information. Machine learning researchers are still unclear about the relationship between estimating MI and Shannon MI. For example, there is controversy over which type of MI needs to be maximized or minimized [
23,
30,
31,
32]. Although significant progress has been made in applying deep learning methods to electronic semantic communication and semantic compression [
17,
18,
33], the theoretical explanation is still unclear, resulting in many different methods.
Thirty years ago, the author extended the information rate–distortion function to obtain the information rate–fidelity function
R(
G) (where
R is the minimum Shannon’s MI for the given semantic MI,
G). The
R(
G) has already been applied to image data compression according to visual discrimination [
5,
7]. Over the past 10 years, the author has incorporated model parameters into the truth function and used it as the learning function [
7,
32,
34], optimizing it with the sample distribution. Machine learning applications include multi-label learning and classification, the maximum MI classification of unseen instances, mixture models [
8], Bayesian confirmation [
35,
36], semantic compression [
37], and solving latent variables. Semantic communication urgently requires semantic compression theories like the information rate–distortion theory [
38,
39,
40]. The
R(
G) function can intuitively display the relationship between Shannon MI and semantic MI and seems to have universal significance.
Due to the above reasons, the first motivation for writing this article is to combine the above three fields to introduce G theory and its applications, demonstrating that different fields can use the same semantic information measures and optimization methods.
On the other hand, researchers hold two extreme viewpoints on semantic information theory. One view argues that Shannon’s theory suffices, rendering a dedicated semantic information theory unnecessary; at most, semantic distortion needs consideration. This is a common practice in the field of electronic semantic communication. The opposite viewpoint advocates for a parallel semantic information theory alongside Shannon’s framework. Among parallel approaches, some researchers (e.g., Carnap and Bar-Hillel) use only logical probability without statistical probability; others incorporate semantic sources, semantic channels, semantic destinations, and semantic information rate distortion [
16].
G theory offers a compromise between these extremes. It fully inherits Shannon’s information theory, including its derived theories. Only the semantic channel composed of truth functions is newly added. Based on Davidson’s truth-conditional semantics [
41], truth functions represent the extensions and semantics of concepts or labels. By leveraging the semantic channel, G theory can
Derive the likelihood function from the truth function and source, enabling semantic probability predictions, thereby quantifying semantic information;
Replace the distortion constraint in Shannon’s theory with semantic constraints, which include semantic distortion, semantic information quantity, and semantic information loss constraints.
The semantic information measure (i.e., the G measure) does not replace Shannon’s information measure but supplants the distortion metric used to evaluate communication quality. Truth functions can be derived from sample distributions using machine learning techniques with the maximum semantic information criterion, addressing the challenges of defining classic distortion functions and optimizing Shannon channels with an information criterion. A key advantage of generalization over reconstruction is that semantic constraint functions can be treated as new and negative distortion functions, allowing the use of existing coding methods without additional coding considerations for electronic semantic communication.
Because of the above reasons, the second motivation for writing this article is to point out that based on Shannon’s information theory, we only need to replace the distortion constraint with the semantic constraints for semantic communication optimization.
G theory has been continuously improved in recent years, and many conclusions are scattered across about ten articles. Therefore, the author wants to provide a comprehensive introduction so that later generations can avoid detours. This is the third motivation for writing this article.
The existing literature reviews of semantic communication theory mainly introduce the progress of electronic semantic communication [
14,
15,
42,
43]. However, this article mainly involves daily semantic communication with its philosophical foundation and machine learning. Although this article mainly introduces G theory, it also compares its differences and similarities with other semantic information theories. Electronic semantic communication theory is different from semantic information theory. The former focuses on one task and involves many theories, whereas the latter focuses on a fundamental theory that involves many tasks. They are complementary, but one cannot replace the other.
The novelty of this article also lies in the introduction of philosophical thoughts behind G theory. In addition to Shannon’s idea, G theory integrates Popper’s views on semantic information, logical probability, and factual testing [
44,
45], as well as Fisher’s maximum likelihood criterion [
46] and Zadeh’s fuzzy set theory [
47,
48].
The primary purposes of this article are to
Introduce G theory and its applications for exchanging ideas with other researchers in different fields related to semantic communication;
Show that G theory can become a fundamental part of future unified semantic information theory.
The main contributions of this article are as follows:
It systematically introduces G theory from a new perspective (replacing distortion constraints with semantic constraints) and points out its connections with Shannon information theory and its differences and similarities with other semantic information theories.
It systematically introduces the applications of G theory in semantic communication, machine learning, Bayesian confirmation, constraint control, and investment portfolios.
It links many concepts and methods in information theory and machine learning for readers to better understand the relationship between semantic information and machine learning.
G theory is also limited, as it is not a complete or perfect semantic information theory. For example, its semantic representation and data compression of complex data cannot keep up with the pace of deep learning.
The remainder of this paper is organized as follows:
Section 2 introduces G theory;
Section 3 discusses the G measure for electronic semantic communication;
Section 4 explores goal-oriented information and information value (in conjunction with portfolio theory); and
Section 5 examines G theory’s applications to machine learning. The final section provides discussions and conclusions, including comparing G theory with other semantic information theories, exploring the concept of information, and identifying G theory’s limitations and areas for further research.
2. From Shannon’s Information Theory to the Semantic Information G Theory
2.1. Semantics and Semantic Probabilistic Predictions
Popper pointed out, in his 1932 book,
The Logic of Scientific Discovery [
44] (p. 102): the significance of scientific hypotheses lies in their predictive power, and predictions provide information; the smaller the logical probability and the more it can withstand testing, the greater the amount of information it provides. Later, he proposed a logical probability axiom system. He emphasized that a hypothesis has two types of probabilities, statistical and logical probabilities, at the same time [
39] (pp. 252–258). However, he had not established a probability system that included both. In Popper’s book,
Conjectures and Refutations [
45] (p. 294), he affirmed more clearly that the value of scientific theory lies in the information. We can say that Popper was the earliest researcher of semantic information theory.
The semantics of a word or label encompass both its connotation and extension. Connotation refers to an object’s essential attribute, while extension denotes the range of objects the term refers to. For example, the extension of “adult” includes individuals aged 18 and above, while its connotation is “over 18 years old”. Extensions for some concepts, like “adult”, may be explicitly defined by regulations, whereas others, such as “elderly”, “heavy rain”, “excellent grades”, or “hot weather”, are more subjective and evolve through usage. Connotation and extension are interdependent; we can often infer one from the other.
According to Tarski’s truth theory [
49] and Davidson’s truth-conditional semantics [
41], a concept’s semantics can be represented by a truth function, which reflects the concept’s extension. For a crispy set, the truth function acts as the characteristic function of the set. For example, if
x is age, and
y1 is the label of the set {adult}, we denote the truth function as
T(
y1|
x), which is also the characteristic function of the set {adult}.
The truth function serves as the tool for semantic probability predictions (illustrated in
Figure 1). The formula is
If “adult” is changed to “elderly”, the crispy set becomes a fuzzy set, so the truth function is equal to the membership function of the fuzzy set, and the above formula remains unchanged.
The extension of a sentence can be regarded as a fuzzy range in a high-dimensional space. For example, an instance described by a sentence with a subject, object, and predicate structure can be regarded as a point in the Cartesian product of three sets, and the extension of a sentence is a fuzzy subset in the Cartesian product. For example, the subject and the predicate are two people in the same group, and the predicate can be one of “bully”, “help”, etc. The extension of “Tom helps Jone” is an element in the three-dimensional space, and the extension of “Tom helps an old man” is a fuzzy subset in the three-dimensional space. The extension of a weather forecast is a subset in the multidimensional space with time, space, rainfall, temperature, wind speed, etc., as coordinates. The extension of a photo or a compressed photo can be regarded as a fuzzy set, including all things with similar characteristics.
Floridi affirms that all sentences or labels that may be true or false contain semantics and provide semantic information [
12]. The author agrees with this view and suggests converting the distortion function and the truth function
T(
yj|
x) to each other. To this end, we define
where exp and log are a pair of inverse functions;
d(
yj|
x) means the distortion when
yj represents
xi. We use
d(
yj|
x) instead of
d(
x,
yj) because the distortion may be asymmetrical.
For example, the pointer on a Global Positioning System (GPS) map has relative deviation or distortion; the distortion function can be converted to a truth function or similarity function:
where
σ is the standard deviation; the smaller it is, the higher the precision.
Figure 2 shows the mobile phone positioning seen by someone on a train.
According to the semantics of the GPS pointer, we can predict that the actual position is an approximate normal distribution on the high-speed rail, and the red pentagram indicates the maximum possible position. If a person is on a specific highway, the prior probability distribution, P(x), is different, and the maximum possible position is the place closest to the small circle on the highway.
Clocks, scales, thermometers, and various economic indices are similar to the positioning pointers and can all be regarded as estimates () with error ranges or extensions, so they can all be used for semantic probability prediction and provide semantic information. A color perception can also be regarded as an estimate of color or color light. The higher the discrimination of the human eye (expressed by a smaller σ), the smaller the extension. A Gaussian function can also be used as a truth function or discrimination function.
2.2. The P-T Probability Framework
Carnap and Bar-Hillel only use logical probability. In the above semantic probability prediction, we use the statistical probability,
P(
x); the logical probability; and the truth value,
T(
y1|
x), which can be regarded as the conditional logical probability. G theory is based on the P-T probability framework, which is composed of Shannon’s probability framework, Kolmogorov’s probability space [
50], and Zadeh’s fuzzy sets [
47,
48].
Next, we introduce the P-T probability framework by its construction process.
Step 1: We use Shannon’s probability framework (see the left part of
Figure 3). The two random variables,
X and
Y, take two values from two domains,
U =
{x1,
x2, …} and
V = {
y1,
y2, …}, respectively. The probability is the limit of the frequency, as defined by Mises [
51]. We call it statistical probability. The statistical probability is defined by “=”, such as
P(
yj|
xi) =
P(
Y =
yj|
X =
xi). Shannon’s probability framework can be represented by a triple (
U,
V,
P).
Step 2: We use domain
U to define the Kolmogorov’s probability of a set (see the right side of
Figure 3). Kolmogorov’s probability space can be represented by a triple (
U,
B,
P), where
B is the Borel field on
U, and its element,
θj, is a subset of
U (see the right side of
Figure 3). If we only consider a discrete
U, then
B is the power set of
U. The probability,
P, is the probability of a subset of
U. It is defined by “∈”, that is,
P(
θj) =
P(
X∈
θj). To distinguish it from the probability of elements, we use
T to denote the probability of a set, so the triple becomes (
U,
B,
T).
Step 3: Let yj be the label of θj, that is, define that there is a bijection between B and V (one-to-one correspondence between their elements), and yj is the image of θj. Hence, we obtain the quintuple (U, V, P, B, T).
Step 4: We generalize θ1, θ2, … to fuzzy sets to obtain the P-T probability framework, which is represented by the same quintuple (U, V, P, B, T).
Because of
yj = “
X is in
θj”,
T(
θj) equals
P(
X = ∈
θj) =
P(
yj is true) (according to Tarsiki’s truth theory [
49]). So,
T(
θj) equals the logical probability of
yj, that is,
T(
yj) ≡
T(
θj).
Given
X =
xi, the conditional logical probability of
yj becomes the truth value
T(
θj|
xi) of the proposition
yj(
xi), and the truth function
T(
θj|
x) is also the membership function of
θj. That is,
We also treat
θj as the model parameter in the parameterized truth function. This makes it easier to establish the connection between the truth function and the likelihood function. According to Davidson’s truth conditional semantics [
41],
T(
θj|
x) reflects the semantics of
yj.
The only problem with the P-T probability framework is that the fuzzy set algebra on
B may not follow Boolean algebra operations. The author used fuzzy quasi-Boolean algebra [
52] to establish the mathematical model of the color vision mechanism, which can solve this problem to a certain extent. Because, in general, we do not need complex fuzzy set functions
f(
θ1,
θ2, …), so this problem can be temporarily ignored.
According to the above definition, yj has two probabilities: P(yj), meaning how often yj is selected, and T(θj), meaning how true yj is. The two are generally not equal. The logical probability of a tautology is one, while its statistical probability is close to zero. We have P(y1) + P(y2) + … + P(yn) = 1, but it is possible that T(y1) + T(y2) + … + T(yn) > 1. For example, the age labels include “adult”, “non-adult”, “child”, “youth”, “elderly”, etc., and the sum of their statistical probabilities is one. In contrast, the sum of their logical probabilities is greater than one because the sum of the logical probabilities of “adult” and “non-adult” alone is equal to one.
According to the above definition, we have
This is the probability of a fuzzy event, as defined by Zadeh [
48].
We can put
T(
θj|
x) and
P(
x) into the Bayes’ formula to obtain the semantic probability prediction formula:
P(x|θj) is the θ in popular method’s likelihood function P(x|yj, θ). We use P(x|θj) here because θj is bound to yj. We call the above formula the semantic Bayes’ formula.
Because the maximum value of
T(
yj|
x) is 1, from
P(
x) and
P(
x|
θj), we derive a new formula:
2.3. Semantic Channel and Semantic Communication Model
Shannon calls
P(
X),
P(
Y|
X), and
P(
Y) the source, the channel, and the destination. Just as a set of transition probability functions
P(
yj|
x) (
j = 1, 2, …) constitutes a Shannon channel, a set of truth value functions
T(
θj|
x) (
j = 1, 2, …) constitutes a semantic channel. The comparison of the two channels is shown in
Figure 3. For convenience, we also call
P(
x),
P(
y|
x), and
P(
y) the source, the channel, and the destination, and we call
T(
y|
x) the semantic channel.
The semantic channel reflects the semantics or extensions of labels, while the Shannon channel indicates the usage of labels. The comparison between the Shannon and the semantic communication models is shown in
Figure 4. The distortion constraint is usually not drawn, but it actually exists.
The semantic channel contains information about the distortion function, and the semantic information represents the communication quality, so there is no need to define a distortion function anymore. The purpose of optimizing the model parameters is to make the semantic channel match the Shannon channel, that is, T(θj|x)∝P(yj|x) or P(x|θj) = P(x|yj) (j = 1, 2, …), so that the semantic MI reaches its maximum value and is equal to the Shannon MI. Conversely, when the Shannon channel matches the semantic channel, the information difference reaches the minimum, or the information efficiency reaches the maximum.
2.4. Generalizing Shannon’s Information Measure to the Semantic Information G Measure
Shannon MI can be expressed as
where
H(
X) is the entropy of
X, reflecting the minimum average code length.
H(
X|
Y) is the posterior entropy of
X, reflecting the minimum average code length after predicting
x based on
y. Therefore, the Shannon MI means the average code length saved due to the prediction
P(
x|
y).
Replacing
P(
xi|
yj) on the right side of the log with the likelihood function
P(
xi|
θj), we obtain the following semantic MI:
where
H(
X|
Yθ) is the semantic posterior entropy of
x:
Roughly speaking,
H(
X|
Yθ) is the free energy,
F, in the Variational Bayes method (VB) [
27,
28] and the minimum free energy (MFE) principle [
20,
21,
22]. The smaller it is, the greater the amount of semantic information.
H(
Yθ|
X) is called the fuzzy entropy, equal to the average distortion,
. According to Equation (2), there is
H(
Yθ) is the semantic entropy:
Note that P(x|yj) on the left side of the log is used for averaging and represents the sample distribution. It can be a relative frequency and may not be smooth or continuous. P(x|θj) and P(x|yj) may differ, indicating that obtaining information needs factual testing. It is easy to see that the maximum semantic MI criterion is equivalent to the maximum likelihood criterion and similar to the Regularized Least Squares (RLS) criterion. Semantic entropy is the regularization term. Fuzzy entropy is a more general average distortion than the average square error.
Semantic entropy has a clear coding meaning. Assume that the sets
θ1,
θ2, … are crispy sets; the distortion function is
If we regard
P(
Y) as the source and
P(
X) as the destination, then the parameter solution of the information rate–distortion function is [
38]
It can be seen that the minimum Shannon MI is equal to the semantic entropy, that is,
R(
D = 0) =
H(
Yθ).
The following formula indicates the relationship between the Shannon MI and the semantic MI and the encoding significance of the semantic MI:
where
KL(…) is the Kullbak–Leibler (KL) divergence [
53] with a likelihood function, which Akaike [
54] first used to prove that the minimum KL divergence criterion is equivalent to the maximum likelihood criterion. The last term in the above formula is always greater than 0, reflecting the average code length of the residual coding. Therefore, the semantic MI is less than or equal to the Shannon MI; it indicates the lower limit of the average code length saved due to semantic prediction.
From the above formula, the semantic MI reaches its maximum value when the semantic channel matches the Shannon channel. According to Equation (15), by letting
P(
x|
θj)=
P(
x|
yj), we can obtain the optimized truth function from the sample distribution:
When
Y =
yj, the semantic MI becomes semantic KL information or semantic side information:
The KL divergence cannot usually be interpreted as information because the smaller it is, the better. But the
I(
X;
θj) above can be regarded as information because the larger it is, the better.
Solving
T*(
θj|
x) with Equation (16) requires that the sample distributions,
P(
x) and
P(
x|
yj), are continuous and smooth. Otherwise, by using Equation (17), we can obtain
The above method for solving
T*(
θj|
x) is called Logical Bayes’ Inference (LBI) [
7] and can be called the random point falling shadow method. This method inherits Wang’s idea of the random set falling shadow [
55,
56].
Suppose the truth function in (10) becomes a similarity function. In that case, the semantic MI becomes the estimated MI [
32], which has been used by deep learning researchers for Mutual Information Neural Estimation (MINE) [
23] and Information Noise Contrast Estimation (InfoNCE) [
24].
In the semantic KL information formula, when
X =
xi,
I(
X;
θj) becomes the semantic information between a single instance
xi and a single label
yj,
I(
xi;
θj) is called the G measure. This measure reflects Popper’s idea about factual testing.
Figure 5 illustrates the above formula. It shows that the smaller the logical probability, the greater the absolute value of the information; the greater the deviation, the smaller the information; wrong hypotheses convey negative information.
Bringing Equation (2) into (19), we have
which means that
I(
X;
θj) equals Carnap and Bar-Hillel’s semantic information minus distortion.
2.5. From the Information Rate–Distortion Function to the Information Rate–Fidelity Function
Shannon defines that given a source, P(x), a distortion function, d(x, y), and the upper limit, D, of the average distortion, , we change the channel, P(y|x), to find the minimum MI, R(D). R(D) is the information rate–distortion function, which can guide us in using Shannon information economically.
Now, we use d(yj|x) = log[1/T(θj|x)] as the asymmetrical distortion function, meaning the distortion when yj is used as the label of x. We replace d(yj|xi) with I(xi; θj), replace with I(X; Yθ), and replace D with the lower limit, G, of the semantic MI to find the minimum Shannon MI, R(G). R(G) is the information rate–fidelity function. Because G reflects the average codeword length saved due to semantic prediction, using G as the constraint is more consistent in shortening the codeword length, and G/R can represent the information efficiency.
The author uses the word “fidelity” because Shannon originally proposed the information rate–fidelity criterion [
1] and later used minimum distortion to express maximum fidelity [
38]. The author has previously called
R(
G) “the information rate of keeping precision” [
6] or “information rate–verisimilitude” [
34].
The
R(
G) function is defined as
We use the Lagrange multiplier method to find the minimum MI and the optimized channel
P(
y|
x). The Lagrangian function is
Using
P(
y|
x) as a variation, we let
. Then, we obtain
where
mij =
P(
xi|
θj)/
P(
xi) =
T(
θj|
xi)/
T(
θj). Using
P(
y) as a variation, we let
. Then, we obtain
where
P+1(
yj) means the next
P(
yj). Because
P(
y|
x) and
P(
y) are interdependent, we can first assume a
P(
y) and then repeat the above two formulas to obtain the convergent
P(
y) and
P(
y|
x) (see [
40] (P. 326)). We call this method the Minimum Information Difference (MID) iteration.
The parameter solution of the
R(
G) function (as illustrated in
Figure 6) is
Any
R(
G) function is bowl-shaped (possibly not symmetrical [
6]), with the second derivative being greater than zero. The
s = d
R/d
G is positive on the right. When
s = 1,
G equals
R, meaning the semantic channel matches the Shannon channel.
G/
R represents the information efficiency; its maximum is one.
G has a maximum value of
G+ and a minimum value of
G− for a given
R.
G− means how small the semantic information the receiver receives can be when the sender intentionally lies.
It is worth noting that, given a semantic channel T(y|x), matching the Shannon channel with the semantic channel, i.e., letting P(yj|x) ∝ T(yj|x) or P(x|yj) = P(x|θj), does not maximize the semantic MI, but minimizes the information difference between R and G or maximizes the information efficiency, G/R. Then, we can increase G and R simultaneously by increasing s. When s→∞ in Equation (23), P(yj|x) (j = 1, 2, …, n) only takes the value zero or one, becoming a classification function.
We can also replace the average distortion with fuzzy entropy,
H(
Yθ|
X) (using semantic distortion constraints), to obtain the information rate-truth function,
R(
Θ) [
35]. In situations where information rather than truth is more important,
R(
G) is more appropriate than
R(
D) and
R(
Θ). The
P(
y) and
P(
y|
x) obtained for
R(
Θ) are different from those obtained for
R(
G) because the optimization criteria are different. Under the minimum semantic distortion criterion,
P(
y|
x) becomes
where
T(
θxi|
y) is a constraint function, so the distortion function
d(
xi|
y) = −log
T(
θxi|
y).
R(
Θ) becomes
R(
D). If
T(
θj) is small, the
P(
yj) required for
R(
G) will be larger than the
P(
yj) required for
R(
D) or
R(
Θ).
2.6. Semantic Channel Capacity
Shannon calls the maximum MI obtained by changing the source, P(x), for the given Shannon channel, P(y|x), the channel capacity. Because the semantic channel is also inseparable from the Shannon channel, we must provide both the semantic and the Shannon channels to calculate the semantic MI. Therefore, after the semantic channel is given, there are two cases: (1) the Shannon channel is fixed; (2) we must first optimize the Shannon channel using a specific criterion.
When the Shannon channel is fixed, the semantic MI is less than the Shannon MI, so the semantic channel capacity is less than or equal to the Shannon channel capacity. The difference between the two is shown in Equation (15).
If the Shannon channel is variable, we can use the MID iteration to find the Shannon channel for
R =
G after each change of the source,
P(
x), and then use
s→∞ to find the Shannon channel,
P(
y|
x), that makes
R and
G reach their maxima simultaneously. At this time,
P(
y|
x) ∈ {0, 1} becomes the classification function. Then, we calculate semantic MI. For a different
P(
x), the maximum semantic MI is the semantic channel capacity. That is
where
Gmax is
G+ when
s→∞ (see
Figure 6). Hereafter, the semantic channel capacity only means
CT(y|x) in the above formula.
Practically, to find CT(Y|X), we can look for x(1), x(2), …, x(j) ∈ U, which are instances under the highest points of T(y1|x), T(y2|x), …, T(yn|x), respectively. Let P(x(j)) = 1/n, j = 1, 2, …, n, and the probability of any other x equals zero. Then we can choose the Shannon channel: P(yj|x(j)) = 1, j = 1, 2, …, n. At this time, I(X; Y)) = H(Y) = logn, which is the upper limit of CT(Y|X). If there is xi among the n x(j)s, which makes more than one truth function true, then either T(yj) > P(yj) or the fuzzy entropy is not zero. CT(Y|X) will be slightly less than logn in this case.
According to the above analysis, the encoding method to increase the capacity of the semantic channel is
Try to choose x that only makes one label’s truth value close to one (to avoid ambiguity and reduce the logical probability of y);
Encoding should make P(yj|xj) = 1 as much as possible (to ensure that Y is used correctly).
Choose P(x) so that each Y’s probability and logical probability are as equal as possible (close to 1/n, thereby maximizing the semantic entropy).
3. Electronic Semantic Communication Optimization
3.1. The Electronic Semantic Communication Model
The previous discussion of semantic communication did not consider conveying semantic information by electronic communication. Assuming that the time and space distance between the sender and the receiver is very far, we must transmit information through cables or disks. At this time, we need to add an electronic channel to the previous communication model, as shown in
Figure 7:
There are two optimization tasks:
With the optimized P(y|x) and P(|y), we can obtain P(|) and restore according to P(|). The will be quite different from x, but the basic feature information will be retained. The restoration quality depends on the feature extraction method.
Task 1 is a machine learning task. Assuming we have selected feature y, we can use the previous method to optimize the encoding with semantic information constraints. For example, the overall task is to transmit fruit image information. In Task 1, x is a high-dimensional feature vector of the fruit, which contains color and shape information; y is the low-dimensional feature vector of a fruit, which contains the fruit variety information. The steps are
- (1)
Extract the feature vector, y, from x using machine learning methods (this is a task beyond G theory);
- (2)
Obtain the sample distribution P(yj|x) (j = 1, 2, …) from the sample;
- (3)
Obtain T(θj|x) ∝ P(yj|x) using LBI and further obtain I(x; yj) = log[T(θj|x)/T(θj)] and G;
- (4)
Use the method of solving the R(G) function to obtain the R (to ensure that G or G/R is large enough) and the P(y|x) that reflect the encoding rule.
We also need to optimize step (1) according to the classification error between and x).
Task 2 is to optimize the electronic channel. Electronic semantic communication is still electronic communication, in essence. The difference is that we need to use semantic information loss instead of distortion as the optimization criterion.
3.2. Optimization of Electronic Semantic Communication with Semantic Information Loss as Distortion
Consider electronic semantic communication. If
ŷj ≠
yj, there is semantic information loss. Farsad et al. call it a semantic error and propose the corresponding formula [
57,
58]. Papineni et al. also proposed a similar formula for translation [
59]. Gündüz et al. [
17] used a method to reduce the loss of semantic information by preserving data features. For more discussion, see [
60]. G theory provides the following method for comparison.
According to G theory, the semantic information loss caused by using
ŷj instead of
yj is
LX(
yj‖
ŷj) is a generalized KL divergence because there are three functions. It represents the average codeword length of residual coding.
Since the loss is generally asymmetric, there may be LX(yj‖ŷj) ≠ LX(yj‖ŷj). For example, when “motor vehicle” and “car” are substituted for each other, the information loss is asymmetric. The reason is that there is a logical implication relationship between the two. Using “motor vehicle” to replace “car”, although it reduces information, it is not wrong; while using “car” to replace “motor vehicle” may be wrong, because the actual may be a truck or a motorcycle. When an error occurs, the semantic information loss is enormous. An advantage of using the truth function to generate the distortion function is that it can reflect concepts’ implications or similarity relationships.
Assuming
yj is a correctly used label, it comes from sample learning, so
P(
x|
θj) =
P(
x|
yj), and
LX(
yj‖
ŷj) =
KL(
P(
x|
θj)‖
P(
x|
j)). The average semantic information loss is
Consider using P(Y) as the source and P(Ŷ) as the destination to encode y. Let d(ŷk|yj) = LX(yj‖ŷk); we can obtain the information rate–distortion function R(D) for replacing Y with Ŷ. We can code Y for data compression according to the parameter solution of the R(D) function.
In the electronic communication part (from Y to Ŷ), other problems can be resolved by classical electronic communication methods, except for using semantic information loss as distortion.
If finding I(x; j) is not too difficult, we can also use I(x; j) as a fidelity function. Minimizing I(X; ŷ) for a given G = I(X; ŷθ), we can obtain the R(G) function between X and Ŷ and compress the data accordingly.
3.3. Experimental Results: Compress Image Data According to Visual Discrimination
The simplest visual discrimination is the discrimination of human eyes to different colors or gray levels. The next is the spatial discrimination of points. If the movement of a point on the screen is not perceived, the fuzzy movement range can represent spatial discrimination, which can be represented by a truth function (such as the Gaussian function). What is more complicated is to distinguish whether two figures are the same person. Advanced image compression methods, such as the Autoencoder, need to extract image features and use features to represent images. The following methods need to be combined with the feature extraction methods in deep learning to obtain better applications.
Next, we use the simplest gray-level discrimination as an example to illustrate digital image compression.
- (1)
Measuring Color Information
A color can be represented by a vector (B, G, R). For convenience, we assume that the color is one-dimensional (or we only consider the gray level), expressed in x, and the color sense, Y, is the estimation of x, similar to the GPS indicator. The universes of x and y are the same, and yj = “x is about xj”. If the color space is uniform, the distortion function can be defined by distance, that is, d(yj|x) = exp[−(x − xj)2/(2σ2)]. Then there is the average information of color perception, I(X; Yθ) = H(Yθ) − .
Given the source P(x) and the discrimination function T(y|x), we can solve P(y|x) and P(y) using the Semantic Variational Basyes (SVB) method. The Shannon channel is matched with the semantic channel to maximize the information efficiency.
- (2)
Gray Level Compression
We used an example to demonstrate color data compression. It was assumed that the original gray level was 256 (8-bit pixels) and needed to be compressed into 8 (3-bit pixels). We defined eight constraint functions, as shown in
Figure 8a.
Considering that human visual discrimination varies with the gray level (the higher the gray level, the lower the discrimination), we used the eight truth functions shown in
Figure 8a, representing eight fuzzy ranges. Appendix C in Reference [
37] shows how these curves are generated. The task was to use the maximum information efficiency (MIE) criterion to find the Shannon channel
P(
y|
x) that made
R close to
G (
s = 1).
The convergent
P(
y|
x) is shown in
Figure 8b.
Figure 8c shows that the Shannon MI and the semantic MI gradually approach in the iteration process. Comparing
Figure 8a,b, we find it easy to control
P(
y|
x) by
T(
y|
x). However, defining the distortion function without using the truth function is difficult. Predicting the convergent
P(
y|
x) by
d(
y|
x) is also difficult.
If we use s to strengthen the constraint, we obtain the parametric solution of the R(G) function. As s→∞, P(yj|x) (j = 1, 2, …) display as rectangles and become classification functions.
- (3)
The Influence of Visual Discrimination and the Quantization Level on the R(G) Function
The author performed some experiments to examine the influence of the discrimination function and the quantization level
b = 2
n (
n is the number of quantization bits) on the
R(
G) function.
Figure 9 shows that when the quantization level was enough, the
R and
G variation range increased with the discrimination increasing (i.e., with
σ decreasing). The result explains how the discrimination determines the semantic channel capacity.
Figure 9a indicates that higher discrimination can convey more semantic information for a given quantization level (
b = 63).
Figure 9b shows that for a given discrimination (
σ = 1/64), less
b will waste the semantic channel capacity. For more discussion on visual information, see Section 6 in [
6].
5. G Theory’s Applications in Machine Learning
5.1. Basic Methods of Machine Learning: Learning Functions and Optimization Criteria
The most basic machine learning method has two steps:
First, we use samples or sample distributions to train the learning functions with a specific criterion, such as the maximum likelihood or RLS criterion;
Then, we make probability predictions or classifications utilizing the learning function with the minimum distortion, minimum loss, or maximum likelihood criterion.
We use the maximum likelihood or RLS criterion when training learning functions. We may use different criteria when using learning functions for classifications.. We generally use maximum likelihood and RLS criteria for prediction tasks where information is important. To judge whether a person is guilty or not, where correctness is essential, we may use the minimum distortion (or loss) criterion. The maximum semantic information criterion is equivalent to the maximum likelihood criterion and is a Regularized Least Distortion (RLD) criterion, including the partition function’s logarithm. Compared with the minimum distortion criterion, the maximum semantic information criterion can reduce the under-reporting of small probability events.
We generally do not use P(x|yj) to train P(x|θj) because if P(x) changes, the originally trained P(x|θj) will become invalid. A parameterized transition probability function, P(θj|x), as a learning function, is still valid even if P(x) is changed. However, using P(θj|x) as a learning function also has essential defects. When class number n > 2, it is challenging to construct P(θj|x) (j = 1, 2, …) because of the normalization restriction, that is, ∑j P(θj|x) = 1 (for each x). As we will see below, there is no restriction when using truth or membership functions as learning functions.
The following sections introduce G theory’s applications to machine learning. If we use existing methods, every task is not easy. Either the solution is complicated, or the convergence proof is difficult.
5.2. For Multi-Label Learning and Classification
Consider multi-label learning, a supervised learning task. From the sample {(xk, yk), k = 1, 2, …, N}, we can obtain the sample distribution, P(x, y). Then, we use Formula (16) or (18) for the optimized truth functions.
Assume that a truth function is a Gaussian function, there should be
So, we can use the expectation and standard deviation of
P(
x|
yj)/
P(
x) or
P(
yj|
x) as the expectation and the standard deviation of
T(
θj|
x). If the truth function is like a dam cross-section (see
Figure 14), we can obtain it through some transformation.
If we only know
P(
yj|
x) but not
P(
x), we can assume that
P(
x) is equally probable, that is,
P(
x) = 1/|
U|, and then optimize the membership function using the following formula:
For multi-label classification, we can use the classifier
If the distortion criterion is used, we can use −log T(θj|x) as the distortion function or replace I(X; θj) with T(θj|x).
The popular Binary Relevance [
70] for multi-label classification converts an
n-label learning task into an
n-pair label learning task. In comparison, the above method is much simpler.
5.3. Maximum MI Classification of Unseen Instances
This classification belongs to semi-supervised learning. We take the medical test and the signal detection as examples (see
Figure 15).
The following algorithm is not limited to binary classifications. Let
Cj be a subset of
C and
yj =
f(
z|
z ∈
Cj); hence,
S = {
C1,
C2, …} is a partition of
C. Our task is to find the optimal
S, which is
First, we initiate a partition. Then, we do the following iterations.
Matching I: Let the semantic channel match the Shannon channel and set the reward function. First, for given
S, we obtain the Shannon channel:
Then we obtain the semantic channel,
T(
y|
x), from the Shannon channel and
T(
θj) (or
mθ(
x,
y) =
m(
x,
y)). Then we have
I(
xi;
θj). For a given
z, we have conditional information as the reward function:
Matching II: Let the Shannon channel match the semantic channel by the classifier
Repeat
Matching I and
Matching II until
S does not change. Then, the convergent
S is the
S* we seek. The author explained the convergence with the
R(
G) function (see Section 3.3 in [
7]).
Figure 6 shows an example. The detailed data can be found in Section 4.2 of [
7]. The two lines in
Figure 16a represent the initial partition.
Figure 16d shows that the convergence is very fast.
However, this method is unsuitable for the maximum MI classification in a high-dimensional space. We need to combine neural network methods to explore more effective approaches.
5.4. The Explanation and Improvement of the EM Algorithm for Mixture Models
The EM algorithm [
71,
72] is usually used for mixure models (clustering), an unsupervised learning method.
We know that P(x) = ∑j P(yj)P(x|yj). Given a sample distribution, P(x), we use Pθ(x) = ∑j P(yj)P(x|θj) to approximate P(x) so that the relative entropy or KL divergence, KL(P‖Pθ), is close to zero. P(y) is the probability distribution of the latent variable to be sought.
The EM algorithm first presets
P(
x|
θj) and
P(
yj),
j = 1, 2, …,
n. The E-step obtains
Then, in the M-step, the log-likelihood of the complete data (usually represented by
Q) is maximized. The M-step can be divided into two steps: M1-step for
and M2-step for
which optimizes the likelihood function. For Gaussian mixture models, we can use the expectation and standard deviation of
P(
x)
P(
yj|
x)/
P+1(
yj) as the expectation and standard deviation of
P(
x|
θj+1).
From the perspective of G theory, the M2-step is to make the semantic channel match the Shannon channel or minimize the VFE, H(X|Yθ), and the E-step and M1-step are to match the Shannon channel with the semantic channel. Repeating the above three steps can make the mixture model converge. The converged P(y) is the required probability distribution of the latent variable. According to the derivation process of the R(G) function, the E-step and M1-step minimize the information difference, R–G; the M-step maximizes the semantic MI. Therefore, the optimization criterion used by the EM algorithm is the MIE criterion.
However, there are two problems with the above method to find the latent variable: (1) P(y) may converge slowly; (2) if the likelihood functions are also fixed, how do we solve P(y)?
Based on the
R(
G) function analysis, the author improved the EM algorithm to the EnM algorithm [
7]. The EnM algorithm includes the E-step for
P(
y|
x), the n-step for
P(
y), and the M-step for
P(
x|
θj) (
j = 1, 2, …). The n-step repeats the E-step and the M1-step in the EM algorithm
n times so that
P+1(
y) ≈
P(
y). The EnM algorithm also uses the MIE criterion. The n-step can speed up the solution of
P(
y). The M2-step only optimizes the likelihood functions. Because
P(
yj)/
P+1(
yj) is approximately equal to one, we can use the following formula to optimize the following model parameters:
Without the n-step, there will be
P(
yj) ≠
P+1(
yj), and ∑
i P(
xi)
P(
x|
θj)/
Pθ(
xi) ≠ 1. When solving mixure models, we can choose a smaller
n, such as
n = 3. When solving
P(
y) specifically, we can select a larger
n until
P(
y) converges. When
n = 1, the EnM algorithm becomes the EM algorithm.
The following mathematical formula proves that the EnM algorithm converges. After the M-step, the Shannon MI becomes
We define
Then, we can deduce that after the E-step, there is
where
KL(
P‖
Pθ) is the relative entropy or KL divergence between
P(
x) and
Pθ(
x); the right KL divergence is
It is close to zero after the n-step.
Equation (81) can be used to prove that the EnM algorithm converges. Because the M-step maximizes G, and the E-step and the n-step minimize R–G and KL(PY+1‖PY), H(P‖Pθ) can be close to zero. We can also use the above method to prove that the EM algorithm converges.
In most cases, the EnM algorithm performs better than the EM algorithm, especially when P(y) is hard to converge.
Some researchers believe that EM makes the mixture model converge because the complete data log-likelihood
Q = −
H(
X,
Yθ) continues to increase [
72], or the negative free energy
F′ =
H(
Y) +
Q continues to increase [
21]. However, we can easily find counterexamples where
R–
G continues to decrease, but
Q and
F′ do not necessarily continue to increase.
The author used the example used by Neal and Hinton [
21] (see
Figure 17), but the mixture proportion in the true model was changed from 0.7:0.3 to 0.3:0.7.
This experiment shows that the decrease in R–G, not the increase in Q or F′, is the reason for the convergence of the mixture model.
The free energy of the true mixture model (with true parameters) is the Shannon conditional entropy H(X|Y). If the standard deviation of the true mixture components is large, H(X|Y) is also large. If the initial standard deviation is small, F is small initially. After the mixture model converges, F must approach H(X|Y). Therefore, F increases (i.e., F′ decreases) during the convergence process. For example, a true mixture model with two components with two standard deviations of 15. If two initial standard deviations are 5, F must continue increasing during the iteration. Many experiments have shown that this is indeed the case.
Equation (81) can also explain pre-training in deep learning, where we need to maximize the model’s predictive ability and minimize the information difference, R–G (or compress data).
5.5. Semantic Variational Bayes: A Simple Method for Solving Hidden Variables
Given
P(
x) and constraints
P(
x|
θj),
j = 1, 2, …, we need to solve
P(
y) that produces
P(
x) = ∑
j P(
yj)
P(
x|
θj).
P(
y) is the probability distribution of the latent variable
y, sometimes called the latent variable. The popular method is the Variational Bayes method (VB for short) [
27,
28,
29]. This method originated from the article by Hinton and Camp [
20]. It was further discussed and applied in the articles by Neal and Hinton [
21], Beal [
28], and Koller [
29] (ch. 11). Gottwald and Braun’s article, “Two Free Energy and the Bayesian Revolution” [
73], discusses the relationship between the MFE principle and ME principle in detail.
VB uses
P(
y) (usually written as
g(
y)) as a variation to minimize the following function:
Using the semantic information method, we express
F as
Since F is equal to the semantic posterior entropy H(X|Yθ) of X after the M1-step of the EM algorithm, we can treat F as H(X|Yθ). Since I(X; Yθ)) = H(X) − H(X|Yθ) = F − H(X|Y), the smaller the F, the larger the semantic MI.
It is easy to prove that when the semantic channel matches the Shannon channel, that is,
T(
θj|
x) ∝
P(
yj|
x) or
P(
x|
θj) =
P(
x|
yj) (
j = 1, 2, …),
F is minimized, and the semantic MI is maximized. Minimizing
F can optimize the prediction model,
P(
x|
θj) (
j = 1, 2, …), but it cannot optimize
P(
y). For optimizing
P(
y), the mean-field approximation [
27,
28] is used; that is,
P(
y|
x) instead of
P(
y) is used as the variation. Only one
P(
yj|
x) is optimized at a time, and the other
P(
yk|
x) (
k ≠
j) remains unchanged. Minimizing
F in this way is actually maximizing the log-likelihood of
x or minimizing
KL(
P‖
Pθ). In this way, optimizing
P(
y|
x) also indirectly optimizes
P(
y).
Unfortunately, when optimizing
P(
y) and
P(
y|
x),
F may not continue to decrease (see
Figure 17). So, VB is suitable as a tool but imperfect as a theory.
Fortunately, it is easier to solve
P(
y|
x) and
P(
y) using the MID iteration in solving
R(
D) and
R(
G) functions. The MID iteration plus LBI for optimizing the prediction model is SVB [
61]. It uses the MIE criterion.
When the constraint changes from likelihood functions to truth functions or similarity functions,
P(
yj|
xi) in the MID iteration formula is changed from
to
From P(x) and the new P(y|x), we can obtain the new P(y). Repeating the formulas for P(y|x) and P(y) will lead to the convergence of P(y). Using s allows us to tighten the constraints for increasing R and G. Choosing proper s enables us to balance between maximizing semantic information and maximizing information efficiency.
The main tasks of SVB and VB are the same: using variational methods to solve latent variables according to observed data and constraints. The differences are
Criteria: In the definition of VB, it adopts the MFE criterion, whereas, for solving P(y), it uses P(y|x) as the variation and hence actually uses the maximum likelihood criterion that makes the mixture model converge. In contrast, SVB uses the MID criterion.
Variational method: VB only uses P(y) or P(y|x) as a variation, while SVB alternatively uses P(y|x) and P(y) as variations.
Computational complexity: VB uses logarithmic and exponential functions to solve
P(
y|
x) [
27]; the calculation of
P(
y|
x) in SVB is relatively simple (for the same task, i.e., when
s = 1).
Constraints: VB only uses likelihood functions as constraint functions. In contrast, SVB allows using various learning functions (including likelihood, truth, membership, similarity, and distortion functions) as constraints. In addition, SVB can use the parameter s to enhance constraints.
Because SVB is more compatible with the maximum likelihood criterion and the ME principle, it should be more suitable for many applications in machine learning. However, because it does not consider the probability of parameters, it may not be as applicable as VB on some occasions.
5.6. Bayesian Confirmation and Causal Confirmation
Logical empiricism was opposed by Popper’s falsificationism [
44,
45], so it turned to confirmation (i.e., Bayesian confirmation) instead of induction or positivism [
74,
75]. Bayesian confirmation was previously a field of concern for researchers in the philosophy of science [
76,
77], and now many researchers in natural sciences have also begun to study it [
78,
79]. The reason is that uncertain reasoning requires major premises, which need to be confirmed.
The main reasons why researchers have different views on Bayesian confirmation are
There are no suitable mathematical tools; for example, statistical and logical probabilities are not well distinguished.
Many people do not distinguish between the confirmation of the relationship (i.e., →) in the major premise y→x and the confirmation of the consequent (i.e., x occurs);
No confirmation measure can reasonably clarify the raven paradox [
74].
To clarify the raven paradox, the author wrote the article “Channels’ confirmation and predictions’ confirmation: from medical tests to the Raven paradox” [
35].
In the author’s opinion, the task of Bayesian confirmation is to evaluate the support of the sample distribution for the major premise. For example, for the medical test (see
Figure 15), a major premise is “If a person tests positive (
y1), then he is infected (
x1)”, abbreviated as
y1→
x1. For a channel’s confirmation, a truth (or membership) function can be regarded as a combination of a clear truth function,
T(
y1|
x) ∈ {0,1}, and a tautology’s truth function (always one):
A tautology’s proportion,
b1′, is the degree of disbelief. The credibility is
b1, and its relationship with
b1′ is
b1′ = 1 − |
b1|. See
Figure 18.
We change
b1 to maximize the semantic KL information,
I(
X;
θ1); the optimized
b1, denoted as
b1*, is the confirmation degree:
where
R+ =
P(
y1|
xi)/
P(
y1|
x0) is the positive likelihood ratio, indicating the reliability of the tested positive. This conclusion is compatible with medical test theory.
Considering the prediction confirmation degree, we assume that
P(
x|
θ1) is a combination of the 0–1 part and the probability-equal part. The ratio of the 0–1 part is the prediction credibility, and the optimized credibility is the prediction confirmation degree:
where
a is the number of positive examples, and
c is the number of negative examples.
Both confirmation degrees can be used for probability predictions, i.e., calculating P(x|θ1).
Hemple proposed the confirmation paradox, namely the raven paradox [
74]. According to the equivalence condition in classical logic, “If
x is a raven, then
x is black” (Rule 1) is equivalent to “If
x is not black, then
x is not a raven” (Rule 2). According to this, white chalk supports Rule 2; therefore, it also supports Rule 1. However, according to common sense, a black crow supports Rule 1, and a non-black raven opposes Rule 1; something that is not a raven, such as a black cat or a white chalk, is irrelevant to Rule 1. Therefore, there is a paradox between the equivalence condition and common sense. Using the confirmation measure
c1*, we can ensure that common sense is correct and the equivalence condition for fuzzy major premises is wrong, thus eliminating the raven paradox. However, other confirmation measures cannot eliminate the raven paradox [
35].
Causal probability is used in causal inference theory [
80]:
It indicates the necessity of the cause,
x1, replacing
x0 to lead to the result,
y1, where
P(
y1|
x) =
P(
y1|do(
x)) is the posterior probability of
y1 caused by intervention
x. The author uses the semantic information method to obtain the channel causal confirmation degree [
36]:
It is compatible with the above causal probability but can express negative causal relationships, such as the necessity of vaccines inhibiting infection because it can be negative.
5.7. Emerging and Potential Applications
- (1)
About self-supervised learning.
Applications of estimating MI have emerged in the field of self-supervised learning. The estimating MI is a special case of semantic MI. Both MINE, proposed by Belghazi et al. [
23], and InfoNCE, proposed by Oord et al. [
24], use the estimating MI. MINE and InfoNCE are essentially the same as the semantic information methods. Their common features are
The truth function, T(θj|x), or similarity function, S(x, yj), proportional to P(yj|x) is used as the learning function. Its maximum value is generally one, and its average is the partition function, Zj.
The estimating information or semantic information between x and yj is log[T(θj|x)/Zj] or log[S(x, yj)/Zj].
The statistical probability distribution, P(x, y), is still used when calculating the average information.
However, many researchers are still unclear about the relationship between the estimating MI and the Shannon MI. G theory’s R(G) function can help readers understand this relationship.
- (2)
About reinforcement learning.
The goal-oriented information introduced in
Section 4.2 can be used as a reward for reinforcement learning. Assuming that the probability distribution of
x in state
sk is
P(
x|
ak−1), which becomes
P(
x|
ak) in state
sk+1, the reward of
ak is
Reinforcement learning is to find the optimal action sequence a1, a2, …, so that the sum of rewards r1 + r2 + … is maximized. Like constraint control, reinforcement learning also needs the trade-off between the maximum purposefulness and the minimum control cost. The R(G) function should be helpful.
- (3)
About the truth function and fuzzy logic for neural networks.
When we use the truth, distortion, or similarity function as the weight parameter of the neural network, the neural network contains semantic channels. Then, we can use semantic information methods to optimize the neural network. Using the truth function, T(θj|x), as the weight is better than using the parameterized inverse probability function, P(θj|x), because there is no normalization restriction when using truth functions.
However, unlike the clustering of points on the plane, points become images for image clustering, and the similarity function between images needs different methods. A common method is to regard an image as a vector and use the cosine similarity between vectors. However, the cosine similarity may have negative values, which require activation functions and biases to make necessary conversions. Combining existing neural network methods and channel-matching algorithms needs further exploration.
Fuzzy logic, especially fuzzy logic compatible with Boolean algebra [
52], seems to be useful in neural networks; for example, the activation function,
Relu(
a–
b) = max(0,
a–
b), which is commonly used in neural networks, is the logical difference operation,
f(
a) = max(0,
a–
b), used in the author’s color vision mechanism model. Truth functions, fuzzy logic, and the semantic information method used in neural networks should make neural networks easier to understand.
- (4)
Explaining data compression in deep learning.
To explain the success of deep neural networks, such as AutoEncoders [
33] and Deep Belief Networks [
81], Tishby et al. [
31] proposed the information bottleneck explanation, affirming that when optimizing deep neural networks, we maximize the Shannon MI between some layers and minimize the Shannon MI between other layers. However, from the perspective of the
R(
G) function, each coding layer of the Autoencoder needs to maximize the semantic MI and minimize the Shannon MI; pre-training is to let the semantic channel match the Shannon channel so that
G ≈
R and
KL(
P‖
Pθ) ≈ 0 (as if for mixture models to converge). Fine-tuning increases
R and
G at the same time by increasing
s (making the partition boundaries steeper).
Not long ago, researchers at OpenAI [
82,
83] explained General Artificial Intelligence by lossless (actually, loss-limited) data compression, similar to the explanation of using MIE.
6. Discussion and Summary
6.1. Core Idea and Key Methods of Generalizing Shannon’s Information Theory
In order to overcome the three shortcomings of Shannon’s information theory (it is not suitable for semantic communication, lacks the information criterion, and cannot bring model parameters or likelihood functions into the MI formula), the core idea for the generalization is to replace distortion constraints with semantic constraints. Semantic constraints include semantic distortion constraints (using log[1/T(θj|x)] as the distortion function), semantic information quantity constraints, and semantic information loss constraints (for electronic semantic communication). In this way, the shortcomings can be overcome, and existing coding methods can be used.
One key method is using the P-T probability framework with the truth function. The truth function can represent semantics, form a semantic channel, and link likelihood and distortion functions so that sample distributions can be used to optimize learning functions (likelihood function, truth function, similarity function, etc.).
The second key method is to define semantic information as the negative regularized distortion. In this way, the semantic MI equals the semantic entropy minus the average distortion, i.e.,
I(
X;
Yθ) =
Hθ(
Y) −
. The MI between temperature and molecular energy also has this form in a thermodynamic local equilibrium system. The logarithm of the softmax function, which is widely used in machine learning, also has this form [
37]. This regularization’s characteristic is that the term has the form of semantic entropy, which contains the logarithm of the logical probability,
T(
θj), or the partition function,
Zj. We call it “partition regularization” to distinguish it from other forms of regularization. Why can this semantic information measure also measure the information in local equilibrium systems? The reason is that the semantic constraint is similar to the energy constraint; both are fuzzy range constraints and can be represented by negative exponential functions.
Because the MFE criterion is equivalent to the Partition-Regularized Least Distortion (PRLD) criterion, the successful applications of VB and the MFE principle also support the PRLD criterion.
Because T(θj) represents the average of the true value, log[T(θj|x)/T(θj)] represents the progress from not so true to true. Comparing the core part of the Shannon MI, log[P(x|yj)/P(x)], we can say that Shannon information comes from the improvement of probability. In contrast, semantic information comes from the improvement of truth. Furthermore, because log[T(θj|x)/T(θj)] = log[P(x|θj)/P(x)], the PRLD criterion is equivalent to the maximum likelihood criterion. Machine learning researchers always believe that regularization can reduce overfitting. Now, we find that, more importantly, it is compatible with the maximum likelihood criterion.
Because of the above two key methods, the three shortcomings of Shannon information theory no longer exist in G theory. Moreover, the success of VB and the MFE principle prompted Shannon information theory researchers to use the minimum VFE criterion or the maximum semantic information criterion to minimize the residual coding length of predictive coding.
Semantic constraints include semantic information loss constraints for electronic communication (see
Section 3.2). Using this constraint, we can use classic coding methods to achieve electronic semantic communication.
6.2. Some Views That Are Different from Those G Theory Holds
View 1: We can measure the semantic information of language itself.
Some people want to measure the semantic information provided by a sentence (such as Carnap and Bar-Hillel), and others measure the semantic information between two sentences, regardless of the facts. In the author’s opinion, these practices ignore the source of information: the real world. G theory follows Popper’s idea and affirms that information comes from factual testing; if a hypothesis conforms to the facts and has a small prior logical probability, there is more information; if it is wrong, there is less or negative information. Some people may say that the translation between two sentences transmits semantic information. In this regard, we must distinguish between actual translation and the formulation of translation rules. Actual translation does provide semantic information, while translation rules do not provide semantic information, but they determine semantic information loss. Therefore, the actual translation should use the maximum semantic information criterion, while optimizing translation rules should use the minimum semantic information loss criterion. Optimizing electronic semantic communication is similar to optimizing translation rules, and the minimum semantic information loss criterion should be used.
View 2: Semantic communication needs to transmit semantics.
Because the transmission of Shannon information requires a physical channel, some people believe that the transmission of semantic information also requires a corresponding physical channel or utilizes the physical channel to transmit semantics. In fact, generally speaking, the semantic channel already exists in the brain or knowledge of the sender and the receiver, and there is no need to consider the physical channel. Only when the knowledge of both parties is inconsistent do we need to consider such a physical channel. The picture–text dictionary is a good physical channel for transmitting semantics. However, the receiver only needs to look at it once, and it will not be needed in the future.
View 3: Semantic information also needs to be minimized.
In Shannon’s information theory, the MI, R, should be minimized when the distortion limit, D, is given to improve communication efficiency. The information rate–distortion function, R(D), provides the theoretical basis for data compression. Therefore, some people imitated the information rate–distortion function and proposed the semantic information rate–distortion function, Rs(D), where Rs is the minimum semantic MI. However, from the perspective of G theory, although we consider semantic communication, the communication cost is still Shannon’s MI, which needs to be minimized. Therefore, G theory replaces the average distortion with the semantic MI to evaluate communication quality and uses the R(G) function.
View 4: Semantic information measures do not require encoding meaning.
Some people believe that the concept of semantic information has nothing to do with encoding and that constructing a semantic information measure does not require considering its encoding meaning. However, the author holds the opposite view. The reasons are (1) semantic information is related to uncertainty, and thus also to encoding; (2) many successful machine learning methods have used cross-entropy to reflect the encoding length and VFE (i.e., posterior cross-entropy H(X|Yθ)) to indicate the lower limit of residual encoding length or reconstruction cost.
The G measure is equal to H(X) − F, reflecting the encoding length saved because of semantic predictions. The encoding length is an objective standard, and a semantic information measure without an objective standard makes it hard to avoid subjectivity and arbitrariness.
6.3. What Is Information? Is the G Measure Compatible with the Daily Information Concept?
Is the G measure, a technical information measure, compatible with the daily information concept? This cannot be ignored.
What is information? This question has many different answers, as summarized by Mark Burgin [
84]. According to Shannon’s definition, information is uncertainty reduced. Shannon information is the uncertainty reduced due to the increase of probability. In contrast, semantic information is the uncertainty reduced due to narrowing concepts’ extensions or improving truth.
From a common-sense perspective, information refers to something previously unknown or uncertain, which encompasses
- (1)
Information from natural language: information provided by answers to interrogative sentences (e.g., sentences with “Who?”, “What?”, “When?”, “Where?”, “Why?”, “How?”, or “Is this?”).
- (2)
Perceptual or observational information: information obtained from material properties or observed representations.
- (3)
Symbolic information: information conveyed by symbols like road signs, traffic lights, and battery polarity symbols [
12].
- (4)
Quantitative indicators’ information: information provided by data, such as time, temperature, rainfall, stock market indices, inflation rates, etc.
- (5)
Associated information: information derived from event associations, such as the rooster’s crowing signaling dawn and a positive medical test indicating disease.
Items 2, 3, and 4 can also be viewed as answers to the questions in item 1, thus providing information. These forms of information involve concept extensions and truth–falsehood and should be semantic information. The associated information in item 5 can be measured using Shannon’s or semantic information formulas. When probability predictions are inaccurate (i.e., P(x|θj) ≠ P(x|yj)), the semantic information formula is more appropriate. Thus, G theory is consistent with the concept of information in everyday life.
In computer science, information is often defined as useful, structured data. What qualifies as “useful”? This utility arises because the data can answer questions or provide associated information. Therefore, the definition of information in data science also ties back to reduced uncertainty and narrowed concept extensions.
6.4. Relationships and Differences Between G Theory and Other Semantic Information Theories
6.4.1. Carnap and Bar-Hillel’s Semantic Information Theory
The semantic information measure of Carnap and Bar-Hillel is [
3]
where
Ip is the semantic information provided by the proposition set,
p, and
mp is the logical probability of
p. This formula reflects Popper’s idea that smaller logical probabilities convey more information. However, as Popper noted, this idea requires the hypothesis to withstand factual testing. The above formula does not account for such tests, implying that correct and incorrect hypotheses provide the same information.
Additionally, G theory differs in calculating logical probability with statistical probability, unlike Carnap and Bar-Hillel’s approach.
6.4.2. Dretske’s Knowledge and Information Theory
Dretske [
9] emphasized the relationship between information and knowledge, viewing information as content tied to facts and knowledge acquisition. Though he did not propose a specific formula, his ideas about information quantification include
The information must correspond to facts and eliminate all other possibilities.
The amount of information relates to the extent of the uncertainty eliminated.
Information used to gain knowledge must be true and accurate.
G theory is compatible with these principles by providing the G measure to implement Dretske’s idea mathematically.
6.4.3. Florida’s Strong Semantic Information Theory
Florida’s theory [
12] emphasizes
Floridi elaborated on Dretske’s ideas and introduced a strong semantic information formula. However, this formula is more complex and less effective at reflecting factual testing than the G measure. For instance, Floridi’s approach ensures that tautologies and contradictions yield zero information but fails to penalize false predictions with negative information.
6.4.4. Other Semantic Information Theories
In addition to the semantic information theories mentioned above, other well-known ones include the theory based on fuzzy entropy proposed by Zhong [
10] and the theory based on synonymous mapping proposed by Niu and Zhang [
16]. Zhong advocated for the combination of information science and artificial intelligence, which greatly influenced China’s semantic information theory research. He employed fuzzy entropy [
85] to define the semantic information measure. However, this approach yielded identical maximum values (1 bit) for both true and false sentences [
10], which is not what we expect. Other people’s semantic information measures using DeLuca and Termini’s fuzzy entropy also encounter similar problems.
Other authors who discussed semantic information measures and semantic entropy include D’Alfonso [
13], Basu et al. [
86], and Melamed [
87]. These authors improved semantic information measures by improving Carnap and Bar-Hillel’s logical probability. The form of semantic entropy is similar to the form of Shannon entropy. The semantic entropy,
H(
Yθ), in G theory differs from these semantic entropies. It contains statistical and logical probabilities and reflects the average codeword length of lossless coding (see
Section 2.3).
Liu et al. [
88] and Guo et al. [
89] used the dual-constrained rate distortion function,
R(
Ds,
Dx), which is meaningful. In contrast,
R(
G) already contains dual constraints because the semantic constraints include distortion and semantic information constraints.
In addition, some fuzzy information theories [
90,
91] and generalized information theories [
92] also involve semantics more or less. However, most of them are far away from Shannon’s information theory.
6.5. Relationship Between G Theory and Kolmogorov’s Complexity Theory
Kolmogorov [
93] defined the complexity of a set of data as the shortest program length required to restore the data without loss:
where
U(
p) is the output of program
p. It defines the information provided by knowledge as the complexity is reduced due to knowledge [
84].
The information measured by Shannon can be understood as the average information provided by y about x, while Kolmgorov’s information is the information provided by knowledge about the co-occurring data. Shannon’s information theory does not consider the complexity of an individual datum, while Kolmogorov’s theory does not consider statistical averages. It can be said that Kolmogorov defines the amount of information in microdata, while Shannon provides a formula to measure the average information in macrodata. The two theories are complementary.
The semantic information measure, I(X; Yθ), is related to Kolmogorov complexity in the following two aspects:
Because knowledge includes the extensions of concepts, the logical relationship between concepts, and the correlation between things (including causality), the information Kolmogorov said contains semantic information.
Because VFE means the reconstruction cost due to the prediction, and the G measure equals H(X) minus VFE, the G measure is similar to Kolmogorov’s information measure. Both mean the saved reconstruction cost.
Some people have proposed complexity distortion [
94], the shortest coding length with an error limit. It is possible to extend complexity distortion to the shortest coding length with the semantic constraint to obtain a function similar to the
R(
G) function. This direction is worth exploring.
6.6. A Comparison of the MIE Principle and the MFE Principle
Friston proposed the MFE principle, which he believed was a universal principle that bio-organisms use to perceive the world and adapt to the environment (including transforming the environment). The core mathematical method he uses is VB. A better-understood principle in G theory is the MIE principle.
The main differences between the two are
G theory regards Shannon’s MI I(X; Y) as free energy’s increment, while Friston’s theory considers the semantic posterior entropy, H(X|Yθ), as free energy.
The methods for finding the latent variable, P(y), and the Shannon channel, P(y|x), are different. Friston uses VB, and G theory uses SVB.
When optimizing the prediction model,
P(
x|
θj) (
j = 1, 2, …), the two are consistent; when optimizing
P(
y) and
P(
y|
x), the results of the two are similar, but the methods are different. SVB is simpler. The reason why the results are similar is that VB uses the mean-field approximation when optimizing
P(
y|
x), which is equivalent to using
P(
y|
x) instead of
P(
y) as a variation and actually uses the MID criterion.
Figure 18 shows that the information difference,
R–
G, instead of VFE continuously decreases in a mixture model’s convergence process.
In physics, free energy can be used to do work; the more, the better. Why should it be minimized? In physics, there are two situations in which free energy is reduced. One is passive reduction because of the increase in entropy. The other reason is to save the consumed free energy while doing work due to considering thermal efficiency. Reducing the consumed free energy conforms to Jaynes’ maximum entropy principle. Therefore, from a physics perspective, it is not easy to understand that one would actively minimize free energy.
MIE is like the maximum doing-work efficiency, W/ΔF*, when using free energy to perform work, W. The MIE principle is easier to understand.
The author will discuss these two principles further in other articles.
6.7. Limitations and Areas That Need Exploration
G theory is still a basic theory. It has limitations in many aspects, which need improvements.
- (1)
Semantics and the distortion of complex data.
Truth functions can represent the semantic distortion of labels. However, it is difficult to express the semantics, semantic similarity, and semantic distortion of complex data (such as a sentence or an image). Many researchers have made valuable explorations [
14,
15,
17,
95,
96,
97]. The author’s research is insufficient.
The semantic relationship between a word and many other words is also very complex. Innovations like Word2Vec [
98,
99] in deep learning have successfully modeled these relationships, paving the way for advancements like Transformer [
100] and ChatGPT. Future work in G theory should aim to integrate such developments to align with the progress in deep learning.
- (2)
Feature extraction.
The features of images encapsulate most semantic information. There are many efficient feature extraction methods in deep learning, such as Convolutional Neural Networks [
101] and AutoEncoders [
33]. These methods are ahead of G theory. Whether G theory can be combined with these methods to obtain better results needs further exploration.
- (3)
The channel-matching algorithm for neural networks.
Establishing neural networks to enable the mutual alignment of Shannon and semantic channels appears feasible. Current deep learning practices, relying on gradient descent and backpropagation, demand significant computational resources. If the channel-matching algorithm can reduce reliance on these methods, it would save computational power and physical free energy.
- (4)
Neural networks utilizing fuzzy logic.
Using truth functions or their logarithms as network weights facilitates the application of fuzzy logic. The author previously used a fuzzy logic method [
52] compatible with Boolean algebra for setting up a symmetrical color vision mechanism model: the 3–8 decoding model. Combining G theory with fuzzy logic and neural networks holds promise for further exploration.
- (5)
Optimizing economic indicators and forecasts.
Forecasting in weather, economics, and healthcare domains provides valuable semantic information. Traditional evaluation metrics, such as accuracy or the average error, can be enhanced by converting distortion into fidelity. Using truth functions to represent semantic information offers a novel method for evaluating and optimizing forecasts, which merits further exploration.
6.8. Conclusions
The semantic information G theory is a generalized Shannon’s information theory. Its validity has been supported by its broad applications across multiple domains, particularly in solving problems related to daily semantic communication, electronic semantic communications, machine learning, Bayesian confirmation, constraint control, and investment portfolios with information values.
Particularly, G theory allows us to utilize the existing coding methods for semantic communication. The core idea is to replace distortion constraints with semantic constraints, for which the information rate–distortion function was extended to the information rate–fidelity function. The key methods are (1) to use the P-T probability framework with truth functions and (2) to define the semantic information measure as the negative partition-regularized distortion.
However, G theory’s primary limitation lies in the semantic representation of complex data. In this regard, it has lagged behind the advancements in deep learning. Bridging this gap will require learning from and integrating insights from other studies and technologies.