Next Article in Journal
Sensor Fusion for Occupancy Estimation: A Study Using Multiple Lecture Rooms in a Complex Building
Previous Article in Journal
Investigating Machine Learning Applications in the Prediction of Occupational Injuries in South African National Parks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Factorizable Joint Shift in Multinomial Classification

Independent Researcher, 8032 Zurich, Switzerland
Mach. Learn. Knowl. Extr. 2022, 4(3), 779-802; https://doi.org/10.3390/make4030038
Submission received: 6 August 2022 / Revised: 6 September 2022 / Accepted: 7 September 2022 / Published: 10 September 2022
(This article belongs to the Section Learning)

Abstract

:
Factorizable joint shift (FJS) was recently proposed as a type of dataset shift for which the complete characteristics can be estimated from feature data observations on the test dataset by a method called Joint Importance Aligning. For the multinomial (multiclass) classification setting, we derive a representation of factorizable joint shift in terms of the source (training) distribution, the target (test) prior class probabilities and the target marginal distribution of the features. On the basis of this result, we propose alternatives to joint importance aligning and, at the same time, point out that factorizable joint shift is not fully identifiable if no class label information on the test dataset is available and no additional assumptions are made. Other results of the paper include correction formulae for the posterior class probabilities both under general dataset shift and factorizable joint shift. In addition, we investigate the consequences of assuming factorizable joint shift for the bias caused by sample selection.

1. Introduction

In machine learning terminology, dataset shift refers to the phenomenon that the joint distribution of features and labels on the training dataset used for learning a model may differ from the related joint distribution on the test dataset to which the model is going to be applied; see Storkey [1] or Moreno-Torres et al. [2] for surveys and background information on dataset shift. Dataset shift can be the consequence of very different causes. For that reason, a catch-all treatment of general dataset shift is difficult if not impossible. As a workaround a number of specific types of dataset shift have been defined in order to introduce additional assumptions that allow for different tailor-made approaches to deal with the problem. The most familiar subtypes of dataset shift are prior probability shift and covariate shift, but more types are introduced on a continuing basis as there is a practice-driven need to do so.
Typically, under dataset shift, the test dataset observations of features are available, but the class labels cannot be observed. In this situation, it is impossible to know ex ante if covariate shift or prior probability shift (or something in between) has occurred. However, estimates of models under assumptions of covariate shift and prior probability shift, respectively, tend to differ conspicuously. As a consequence, additional assumptions need to be made in order to be able to choose between modelling options related to covariate shift and prior probability shift. Such additional assumptions may be phrased in terms of causality (Storkey [1]): if the features can be considered “causing” the class labels, then models designed to deal with covariate shift are appropriate. Otherwise, if the class “causes” features, models targeting prior probability shift should be preferred.
He et al. [3] recently proposed “factorizable joint shift” (FJS) which generalises both prior probability shift and covariate shift. They went on with presenting the “joint importance aligning” method for estimating the characteristics of this type of shift. At first glance, He et al. hence seemed to provide a way to avoid choosing ex ante between covariate shift and prior probability shift models. Instead, “joint importance aligning” (plus some regularisation) appeared to be a method that functioned as a covariate shift model, prior probability shift model, or combined covariate and label shift model, as required by the characteristics of the test dataset.
By a detailed analysis of factorizable joint shift in multinomial classification settings, in this paper we point out that general factorizable joint shift is not fully identifiable if no class label information on the test dataset is available and no additional assumptions are made. This is in contrast to the situations with covariate shift or prior probability shift. Therefore, circumspection is recommended with regard to potential deployment of “joint importance aligning” as proposed by He et al. [3].
He et al. characterised factorizable joint shift by claiming that “the biases coming from the data and the label are statistically independent”. This description might not fully hit the mark. As we demonstrate in this paper, factorizable joint shift has little to do with statistical independence but should rather be interpreted as a structural property similar to the “separation of variables” which plays an important role for finding closed-form solutions to differential equations. We also argue that, in probabilistic terms, factorizable joint shift perhaps is better described as “scaled density ratios” shift.
The plan of this paper and its main research contributions are as follows:
  • Section 2 “Setting the scene” presents the assumptions, concepts and notation for the multinomial (or multiclass) classification setting of this paper.
  • Section 3 “General dataset shift in multinomial classification” introduces a normal form for the joint density of features and class labels (Theorem 1) and derives in Corollary 2 a generalisation of the correction formula for class posterior probabilities of Saerens et al. [4] and Elkan [5].
  • Section 4 “Factorizable joint shift” defines this kind of dataset shift in a mathematically rigorous manner and presents a full representation in terms of the source (training) distribution, the target (test) prior class probabilities and the target marginal distribution of the features (Theorem 2). In addition, a specific version of the posterior correction formula is given (Corollary 4), and the description of factorizable joint shift as “scaled density ratios” shift is motivated. Moreover, alternatives to the “joint importance aligning” of He et al. [3] are proposed (Section 4.1).
  • Section 5 “Common types of dataset shift” examines in a mathematically rigorous manner for a number of types of dataset shift mentioned in the literature if they are implied by or imply factorizable joint shift. The types of dataset shift treated in this section are prior probability shift, covariate shift, covariate shift with posterior drift, domain invariance and generalised label shift. In addition, the posterior correction formulae specific for these types of dataset shift are presented.
  • Section 6 “Sample selection bias” revisits the topic of dataset shift caused by sample selection bias and looks at the question of how the class-wise selection probabilities look like if the induced dataset shift is factorizable joint shift (Theorem 3).
  • Section 7 “Conclusions” provides a short discussion of the important findings of the paper and points to some open research questions.

2. Setting the Scene

In this paper, we use the following population-level description of the multinomial classification problem under dataset shift in terms of measure theory. See standard textbooks on probability theory like Billingsley [6] or Klenke [7] for formal definitions and background of the notions introduced in Assumption 1. See Tasche [8] for a detailed reconciliation of the setting of this paper with the concepts and notation used in the mainstream machine learning literature.
Assumption 1.
( Ω , F ) is a measurable space. The source distribution P and the target distribution Q are probability measures on ( Ω , F ) . For some positive integer d 2 , events A 1 , , A d F and a sub-σ-algebra H F are given. The events A i , i = 1 , , d , and H have the following properties:
(i) 
i = 1 d A i = Ω .
(ii) 
A i A j = ø , i , j = 1 , d , i j .
(iii) 
0 < P [ A i ] , i = 1 , , d .
(iv) 
0 < Q [ A i ] , i = 1 , , d .
(v) 
A i H , i = 1 , , d .
In the literature, P is also called “source domain” or “training distribution” while Q is also referred to as “target domain” or “test distribution’.
The elements ω of Ω are objects (or instances) with class (label) and covariate (or feature) attributes. ω A i means that ω belongs to class i (or the positive class in the binary case if i = 1 ).
The σ -algebra F of events F F is a collection of subsets F of Ω with the property that they can be assigned probabilities P [ F ] and Q [ F ] in a logically consistent way. In the literature, thanks to their role of reflecting the available information, σ -algebras are sometimes also called “information set” (Holzmann and Eulert [9]). In the following, we use both terms exchangeably.
The sub- σ -algebra H F generated by the covariates (features) contains the events which are observable at the time when the class of an object ω has to be predicted. Since A i H , i = 1 , , d , then the class of the object may not yet be known. In this paper, we assume that under the source distribution P, the class events A i can be observed such that the prior class probabilities can be estimated. In contrast, under the target distribution Q, the events A i cannot be directly observed and can only be predicted on the basis of the events H H , which are assumed to reflect the features of the object.
For technical reasons, it is convenient to define the joint information set H ¯ of features and class labels:
Definition 1.
We denote by A = σ ( { A 1 , , A d } ) the minimal sub-σ-algebra of F containing all A i , i = 1 , , d and by H ¯ the minimal sub-σ-algebra of F containing both H and A , i.e., H ¯ = σ ( H A ) .
Note that the σ -algebra A can be represented as
A = i = 1 d ( A i F i ) : F 1 , , F d { ø , Ω } ,
while the σ -algebra H ¯ can be written as
H ¯ = i = 1 d ( A i H i ) : H 1 , , H d H .
A standard assumption in machine learning is that source and target distribution are the same, i.e., P = Q . The situation where P [ F ] Q [ F ] holds for at least one F H ¯ is called dataset shift (Moreno-Torres et al. [2], Definition 1).
Under dataset shift as defined this way, typically, classifiers or posterior class probabilities learnt under the source distribution stop working properly under the target distribution. Finding algorithms to deal with this problem is one of the tasks in the field of domain adaptation.
In this paper, we are mostly interested in exploring how posterior class probabilities change between a source and a target distribution as described in Assumption 1. In particular, we provide generalisations of the posterior correction formula (2.4) of Saerens et al. [4] (see also Theorem 2 of Elkan [5]). For this purpose, the notions of conditional expectation and conditional probability are crucial.
In the following, E P denotes conditional or unconditional expectation with respect to the probability measure P. For a given probability space ( Ω , F , P ) , we refer to Section 8.2 of Klenke [7] for the formal definitions and properties of
  • The expectation E P [ X | H ] of a real-valued random variable X conditional on a sub- σ -algebra H ;
  • The probability P [ F | H ] of an event F F conditional on H .
In the machine learning literature, often the term posterior class probability rather than conditional probability is used to refer to the conditional probabilities P [ A i | H ] and Q [ A i | H ] , i = 1 , , d , in the context of Assumption 1. In contrast, the term prior probability is used for the probabilities P [ A i ] and Q [ A i ] , which in our measure-theoretic setting should rather be called unconditional probabilities of A i .
An assumption of absolute continuity is also crucial for an investigation of how the posterior class probabilities are impacted by a change from the source distribution to the target distribution. Formally, this assumption reads as follows:
Assumption 2.
Assumption 1 holds, and Q is absolutely continuous with respect to P on H ¯ , i.e.,
Q | H ¯ P | H ¯ ,
where M | H stands for the measure M with domain restricted to H .
The statement “Q is absolutely continuous with respect to P on H ¯ ” means that for all events N H ¯ , P [ N ] = 0 implies Q [ N ] = 0 . Hence, “impossible” events under P are also impossible under Q. Measure-theoretic impossibility is somewhat unintuitive because for continuous distributions each single outcome has probability 0 and therefore is impossible. Nonetheless, sampled values from such distributions are single outcomes and occur despite having probability 0.
However, the statement “for all events N H ¯ , P [ N ] = 0 implies Q [ N ] = 0 ” is equivalent to saying: for all events N H ¯ , Q [ N ] > 0 implies P [ N ] > 0 . This means that “possible” events under Q are also possible events under P, even if with very tiny probabilities of occurrence. This phrasing of absolute continuity is more intuitive and is preferred by some authors, for instance by He et al. [3] who in Section 2 make the assumption D T ( x , y ) > 0 D S ( x , y ) > 0 , which they seem to understand in the sense of Assumption 2.
As mentioned before, if the target distribution Q is absolutely continuous with respect to P, there may be events whose probabilities under Q are much greater than their probabilities under P. From a practical point of view, such events may even appear to be “impossible” under P. Notions such as “sufficient support” and “support sufficiency divergence” (Johansson et al. [10]) suggest that such is the view of the machine learning community. Hence, Assumption 2 is not necessarily in contrast to the working assumption of partially or fully nonoverlapping source and target domains made by many researchers in unsupervised domain adaptation.
For analyses of the case of domains where the source does not completely cover the target (such that Assumption 2 may be violated), see Johannsson et al. [10]. However, the statement of Johannsson et al., Section 5, “If this overlap is increased without losing information, such as through collection of additional samples, this is usually preferable.” suggests that an assumption of nonoverlapping support is not the same as an assumption on a lack of absolute continuity. For according to the statement by Johannsson et al., events outside of the source support do not appear to be impossible because in that case the “collection of additional samples” could not increase the support overlap between source and target.
Assumption 2 is stronger than the common assumption of absolute continuity on H (see for instance, Scott [11]), but in terms of interpretation there is no big difference: all events possible under the target distribution (including in label space) are also possible under the source distribution.
An important consequence of Assumption 2 is that we can use the source distribution P as a reference measure for the target distribution Q. This is more natural than introducing another measure without real-world meaning as a reference for both P and Q. In addition, renouncing another measure as a reference has the advantageous effect of simplifying notation.
Recall the following common conventions intended to make the measure-theoretic notation more incisive:
Notation 1.
An important consequence of deploying a measure-theoretic framework as in this paper is that real-valued random variables X on a fixed probability space ( Ω , F , P ) are uniquely defined only up to events of probability 0 and may be undefined or ill-defined on such events or when being multiplied with the factor 0. To be more specific:
  • If X is another random variable such that P [ X X ] = 0 , then E P [ X ] exists if and only if E P [ X ] exists. In this case, E P [ X ] = E P [ X ] follows.
  • If X is undefined or ill-defined on an event N F with P [ N ] = 0 , then by definition E p [ X ] exists if and only if E P [ X ] exists for
    X = X , on Ω \ N , 0 , on N .
    In this case, E P [ X ] is defined as E P [ X ] .
  • If X is undefined or ill-defined on an event F F but is multiplied with another random variable Z which takes the value 0 on F, then, by definition, E p [ X Z ] exists if and only if E P [ X ] exists for
    X = X Z , on Ω \ F , 0 , on F .
    In this case, E P [ X Z ] is defined as E P [ X ] .
The conventions listed in Notation 1 are convenient and used frequently in the following text. Note, however, that they are only valid in the context of a fixed probability measure P. For instance, under Assumption 2, if the event N where the random variable X has probability 0 under the source distribution P of being undefined, i.e., P [ N ] = 0 , then Q [ N ] = 0 follows as well, such that E Q [ X ] should be well-defined. Nonetheless, Q [ N ] = 0 does not necessarily imply P [ N ] = 0 such that E Q [ X ] might be well-defined despite E P [ X ] being ill-defined.
In the same vein, under Assumption 2, for the posterior class probabilities P [ A i | H ] , i = 1 , , d , the expectations E Q P [ A i | H ] are well-defined. However, for the posterior class probabilities Q [ A i | H ] , i = 1 , , d , the expectations E P Q [ A i | H ] are potentially ill-defined because there could be versions of Q [ A i | H ] which are indistinguishable under Q but different with positive probability under P. In the following, we are careful to avoid such issues whenever the discussion involves more than one probability measure.

3. General Dataset Shift in Multinomial Classification

Under Assumption 2, by the Radon–Nikodym theorem, there is an H ¯ -measurable density h ¯ = d Q d P | H ¯ of the target distribution Q with respect to the target distribution P on the joint information set H ¯ defined by (1b). This density links Q to P by Equation (2):
Q [ F ] = E P [ h ¯ 1 F ] , for all F H ¯ .
In (2) and in the remainder of the paper, 1 F denotes the indicator function of F, defined by 1 F ( ω ) = 1 if ω F and 1 F ( ω ) = 0 if ω F .
Unfortunately, in practice h ¯ is more or less unobservable. Therefore, it is desirable to decompose it into smaller parts which may be observable or can perhaps be determined through reasonable assumptions. The key step to such a decomposition is made with the following combination of definitions and lemma.
Definition 2.
Under Assumption 1, define the following class-conditional distributions, by letting for F F and i = 1 , , d
P i [ F ] = P [ F | A i ] = P [ A i F ] P [ A i ] and Q i [ F ] = Q [ F | A i ] = Q [ A i F ] Q [ A i ] .
In the literature, when restricted to the feature information set H , the P i and Q i sometimes are called class-conditional feature distributions.
Lemma 1.
Under Assumption 2, for i = 1 , , d , the class-conditional feature distribution Q i is absolutely continuous with respect to P i on H .
Denote by h i = d Q i d P i | H a Radon–Nikodym derivative (or density) of Q i with respect to P i . If there is another H -measurable function h i * 0 with the density property, i.e., Q i [ H ] = E P i [ h i * 1 H ] for all H H , then it follows that
P h i h i * , P [ A i | H ] > 0 = 0 = P [ { h i h i * } A i ] .
Proof. 
Fix i and choose any N H ¯ with P i [ N ] = 0 . Then, it follows that N A i H ¯ and P [ N A i ] = 0 . By Assumption 2, Q [ N A i ] = 0 follows, which implies
Q i [ N ] = Q [ N A i ] Q [ A i ] = 0 .
Hence, we have Q i | H ¯ P i | H ¯ from which Q i | H P i | H follows. The uniqueness of Radon–Nikodym derivatives implies
0 = P i [ h i h i * ] = P [ { h i h i * } A i ] P [ A i ] ,
and hence the right-hand side of (4). However, by the definition of conditional probability it also follows that
0 = P [ { h i h i * } A i ] = E P 1 { h i h i * } P [ A i | H ] .
This implies the left-hand side of (4). □
With Lemma 1 as preparation, we are in a position to state the following key representation result and some corollaries for the joint density h ¯ of features and class labels. In the remainder of this paper, we make use of (5) as a normal form for h ¯ .
Theorem 1.
Under Assumption 2, the density h ¯ of Q with respect to P on H ¯ can be represented as
h ¯ = i = 1 d h i Q [ A i ] P [ A i ] 1 A i
where the h i are any densities of Q i with respect to P i on H as introduced in Lemma 1, for i = 1 , , d .
Proof. 
Let F H ¯ . By (1b), then it holds that
F = i = 1 d ( A i H i ) for some H 1 , , H d H .
This implies
Q [ F ] = i = 1 d Q [ A i ] Q i [ H i ] = i = 1 d Q [ A i ] E P i [ h i 1 H i ] = i = 1 d Q [ A i ] P [ A i ] E P [ h i 1 H i A i ] = E P i = 1 d h i Q [ A i ] P [ A i ] 1 A i 1 F .
Equation (5) follows from this by the definition of Radon–Nikodym derivatives. □
Corollary 1.
Under Assumption 2, the density h of Q with respect to P on H can be written as
h = i = 1 d h i Q [ A i ] P [ A i ] P [ A i | H ] .
Proof. 
The corollary follows from Theorem 1 because h = E P [ h ¯ | H ] . □
Corollary 2.
Under Assumption 2, for i = 1 , , d , the conditional probability (posterior class probability) Q [ A i | H ] can be represented as
Q [ A i | H ] = h i Q [ A i ] P [ A i ] P [ A i | H ] j = 1 d h j Q [ A j ] P [ A j ] P [ A j | H ] ,
on the set { h > 0 } , where h denotes the denominator of the right-hand side of (6) (and the density of Q with respect to P on H , as introduced in Corollary 1).
Equation (6) generalises Equation (2.4) of Saerens et al. [4] and Theorem 2 of Elkan [5] from prior probability shift to general dataset shift. Saerens et al. commented on their Equation (2.4) as follows: “This well-known formula can be used to compute the corrected a posteriori probabilities, ”. Hence, in this paper we call (6) the posterior correction formula.
Recall that under Assumption 2, it holds that Q [ h > 0 ] = 1 while P [ h > 0 ] < 1 is possible. Hence, Q [ A i | H ] is fully specified by (6) under Q but possibly only incompletely specified under P.
Proof of Corollary 2.
Apply the generalised Bayes formula (see Lemma A1 in Appendix A) with F = H ¯ , f = h ¯ , G = H and X = 1 A i . □
A direct application of the posterior correction formula (6) is not possible because the target prior probabilities Q [ A i ] and the target class conditional feature densities h i typically are unknown. However, in some cases the target priors might be known from external sources such as central banks, IMF or national offices of statistics. Under more specific assumptions on the type of dataset shift, it may be possible to estimate the target priors from the target dataset. See González et al. [12] for a survey of estimation methods under the assumption of prior probability shift.
Under prior probability shift, h i = 1 is assumed for all i (see Section 5.1 below). This means there is no change of the conditional feature distributions. This assumption might be too strong in some situations. It might be more promising to assume similar changes for all classes (i.e., h i h j for i j ), for instance, by assuming factorizable joint shift (see Section 4 below), or by trying to find transformations (or representations) of the features that make the resulting feature densities similar (see Section 5.4 and Section 5.5 below).
For the sake of completeness, we also mention the following alternative representation (7b) of h ¯ = d Q d P | H ¯ . Compared to (7b), (5) provides more structural information, in particular when taking into account Corollary 2 above and, therefore, is potentially more useful.
Corollary 3.
Under Assumption 2, let h be a density of Q with respect to P on H . Then, the target posterior class probabilities Q [ A i | H ] vanish on the event { h > 0 } if the source posterior class probabilities P [ A i | H ] vanish on { h > 0 } , i.e., it holds on { h > 0 } that
P [ A i | H ] = 0 Q [ A i | H ] = 0 .
Moreover, the density h ¯ of Q with respect to P on H ¯ can be represented as
h ¯ = h i = 1 d Q [ A i | H ] P [ A i | H ] 1 A i .
Proof. 
Equation (7a) follows immediately from Corollary 2. Taking into account Notation 1 for the meaning of (7b) on the event { P [ A i | H ] = 0 } , the equation follows from (1b) and the definition of the posterior class probabilities. □
The following result may be considered an inversion of the previous results and in particular Corollary 2 on the relationship between source and target distributions. It is of interest mostly for dealing with sample selection bias (see Section 6 below).
Proposition 1.
In the setting of Theorem 1, assume additionally that P [ h ¯ = 0 ] = 0 holds. Then, the following statements hold true:
(i) 
P is absolutely continuous with respect to Q on H ¯ , with d P d Q | H ¯ = 1 / h ¯ .
(ii) 
For i = 1 , , d , the source class-conditional feature distribution P i is absolutely continuous with respect to Q i on H , with Q i [ h i = 0 ] = 0 = P [ h i = 0 ] and
d P i d Q i | H = 1 h i .
(iii) 
The density d P d Q | H ¯ can also be represented as
d P d Q | H ¯ = i = 1 d 1 h i P [ A i ] Q [ A i ] 1 A i .
(iv) 
The density d P d Q | H can be represented as
d P d Q | H = i = 1 d 1 h i P [ A i ] Q [ A i ] Q [ A i | H ] .
(v) 
For i = 1 , , d , it holds that
P [ A i | H ] = 1 h i Q [ A i | H ] P [ A i ] Q [ A i ] j = 1 d 1 h j Q [ A j | H ] P [ A j ] Q [ A j ] .
Proof. 
(i) is a well-known property of equivalent probability measures (see Problem 32.6 of Billingsley [6]).
By (i), P is absolutely continuous with respect to Q on H ¯ . This implies that P i is absolutely continuous with respect to Q i on H and, again by Problem 32.6 of [6], the rest of (ii) follows as well.
Properties (iii), (iv) and (v) follow from (i) and (ii), by making use of Theorem 1 and Corollaries 1 and 2 with swapped roles of P and Q. □

4. Factorizable Joint Shift

The following definition translates Definition 2.2 of He et al. [3] into the setting of this paper.
Definition 3.
Under Assumption 2, we say that the target distribution Q is related to the source distribution P by factorizable joint shift (FJS), if there are a non-negative H -measurable function g and a non-negative A -measurable function b such that the density h ¯ of Q with respect to P on H ¯ can be represented as
h ¯ = g b .
Observe that the functions g and b of Definition 3 are not uniquely determined because for any c > 0 the functions g c = c g and b c = b / c are also H -measurable and A -measurable, respectively, and satisfy
h ¯ = g c b c .
In the remainder of this section, we show that the functions g and b depend on the source distribution P as well as the marginal distributions of Q on H and A , respectively, but not on the joint distribution Q | H ¯ . For the case d = 2 , in Section 4.2 below we obtain the stronger result that
  • g and b are uniquely determined (up to the ambiguity expressed by (8b)) by the marginal distributions of Q on H and A and the source distribution P;
  • With fixed source distribution P, for each pair of marginal distributions of Q on H and A , there exists (up to a constant factor) a factorization (8a).
Theorem 2.
Under Assumption 2, let the source distribution P and the target distribution Q be related by joint factorizable shift in the sense of Definition 3. Denote by h the density of Q with respect to P on H and let q i = Q [ A i ] and p i = P [ A i ] , i = 1 , , d .
Then, up to a constant factor c as in (8b), it follows that
b = i = 1 d 1 ϱ i q i p i 1 A i + q d p d 1 A d and
g = h i = 1 d 1 ϱ i q i p i P [ A i | H ] + q d p d P [ A d | H ] ,
where the constants ϱ 1 , , ϱ d 1 are positive and finite and satisfy the following equation system:
p j = ϱ j E P h P [ A j | H ] i = 1 d 1 ϱ i q i p i P [ A i | H ] + q d p d P [ A d | H ] , j = 1 , , d 1 .
Conversely, let an H -measurable function h 0 with E P [ h ] = 1 and ( q i ) i = 1 , , d ( 0 , 1 ) d with i = 1 d q i = 1 be given. If ϱ 1 > 0 , …, ϱ d 1 > 0 are solutions of the equation system (9c) and b and g are defined by (9a) and (9b), respectively, then g b is a density of a probability measure Q with respect to P on H ¯ , such that h is the marginal density of Q with respect to P on H and Q [ A i ] = q i holds for i = 1 , , d .
Proof. 
First, we show that (9a)–(9c) are necessary if Q and P are related by factorizable joint shift as in (8a).
Since b is A -measurable by assumption, there are constants β 1 , , β d R such that
b = i = 1 d β i 1 A i .
For fixed k { 1 , , d } , this implies
q k = E P [ g b 1 A k ] = β k E P [ g 1 A k ] = β k p k E P k [ g ] .
By Assumption 2, we have q k > 0 and p k > 0 . Hence, it follows E P k [ g ] > 0 and
β k = q k p k E P k [ g ] > 0 .
As g b is by assumption an H ¯ -density of Q with respect to P, it follows that
h = E p [ g b | H ] = g i = 1 d q i p i E P i [ g ] P [ A i | H ] .
The relations 1 = i = 1 d P [ A i | H ] and (10b) imply
i = 1 d q i p i E P i [ g ] P [ A i | H ] > 0 .
Therefore, we obtain
g = h i = 1 d q i p i E P i [ g ] P [ A i | H ] .
For k { 1 , , d 1 } , (11) implies
E P k [ g ] = E P g P [ A k | H ] p k = 1 p k E P h P [ A k | H ] i = 1 d q i p i E P i [ g ] P [ A i | H ] ,
and, equivalently,
p k = E P d [ g ] E P k [ g ] E P h P [ A k | H ] i = 1 d 1 E P d [ g ] E P i [ g ] q i p i P [ A i | H ] + q d p d P [ A d | H ] .
With ϱ k = E P d [ g ] E P k [ g ] > 0 , this implies (9c). Equations (9a) and (9b) follow from multiplying (10a) with E P d [ g ] and (11) with 1 / E P d [ g ] , respectively.
The converse statement follows from the following observations:
  • With b and g as in (9a) and (9b), E P [ g b ] = 1 holds such that g b is an H ¯ -measurable density with respect to P.
  • Furthermore, E P [ g b | H ] = h holds such that h is the marginal density of g b on H with respect to P.
  • For j { 1 , , d 1 } , (9c) is actually equivalent to
    Q [ A j ] = E P [ g b 1 A j ] = q j .
Finally, q d = Q [ A d ] is implied by i = 1 d q i = 1 . □
Thanks to Theorem 2, the following version of the posterior correction formula (6) can be given for factorizable joint shift.
Corollary 4.
Under Assumption 2, let the source distribution P and the target distribution Q be related by joint factorizable shift in the sense of Definition 3. Denote by h the density of Q with respect to P on H . Then, the target posterior probabilities Q [ A j | H ] , j = 1 , , d , can be represented as functions of the source posterior probabilities P [ A j | H ] , j = 1 , , d , in the following way on the event { h > 0 } :
Q [ A j | H ] = ϱ j Q [ A j ] P [ A j ] P [ A j | H ] i = 1 d 1 ϱ i Q [ A i ] P [ A i ] P [ A i | H ] + Q [ A d ] P [ A d ] P [ A d | H ] , j = 1 , , d 1 , Q [ A d | H ] = Q [ A d ] P [ A d ] P [ A d | H ] i = 1 d 1 ϱ i Q [ A i ] P [ A i ] P [ A i | H ] + Q [ A d ] P [ A d ] P [ A d | H ] ,
where the positive constants ϱ 1 , , ϱ d 1 satisfy the equation system (9c).
Proof. 
Apply the generalised Bayes formula (Lemma A1 in Appendix A) for G = H , X = 1 A j and f = g b , with g and b specified by (9b) and (9a), respectively. □
Remark 1.
Assuming P [ A d | H ] > 0 , (12) implies
Q [ A j | H ] Q [ A d | H ] Q [ A d ] Q [ A j ] = ϱ j P [ A j | H ] P [ A d | H ] P [ A d ] P [ A j ] , j = 1 , , d 1 .
Recall that P [ A k | H ] / P [ A k ] is the density with respect to P of the class-conditional feature distribution P k , as defined by (3), on the feature information set H . Similarly, Q [ A k | H ] / Q [ A k ] is the density with respect to Q of the class-conditional feature distribution Q k on H . Therefore, (13) states that under factorizable joint shift, the ratios of the class-conditional feature densities are invariant up to a constant factor.
Remark 1 suggests joint factorizable shift could also be called scaled density ratios shift. This term would emphasise a probabilistic interpretation of this kind of dataset shift, in contrast to “factorizable joint shift” with its focus on the technical aspect of separation of input and output variables.

4.1. Alternatives to Joint Importance Aligning

He et al. [3] proposed in Section 3 the “joint importance aligning” method for estimating a factorized version of the ratio of source and target domain densities which they called “joint importance weight”. He et al. presented a “supervised” and an “unsupervised” version of their method. The “unsupervised” version was intended for the case where no class labels were observed in the target domain, i.e., the case considered primarily in this paper.
Regarding the performance of the “unsupervised” version of their proposal, He et al. indicated that the proposed method tended to present simple covariate shift (see Section 5.2 below) as a solution. This does not come as a surprise because He et al. [3] stated “… in unsupervised objective, we define V ˜ ( x ) E y D S ( y | x ) V ( y ) ⋯”, which suggests that the authors implicitly assumed D S ( y | x ) = D T ( y | x ) , i.e., covariate shift. Without providing an explanation, He et al. proposed a discretisation of the data (covariate) space in order to prevent the algorithm from converging to covariate shift as solution.
Given these qualms about “joint importance aligning’, it might be useful to point out alternative approaches to finding the factorization (8a), based on Theorem 2. The theorem suggests two obvious ways to learn the characteristics of factorizable joint shift:
(a)
If the target prior class probabilities Q [ A i ] are known (for instance from external sources), solve (9c) for the constants ϱ i .
(b)
If the target prior class probabilities Q [ A i ] are unknown, fix values for the constants ϱ i and solve (9c) for the Q [ A i ] . Letting ϱ i = 1 for all i is a natural choice that converts (9c) into the system of maximum likelihood equations for the Q [ A i ] under the prior probability shift assumption.
See Section 4.2.4 of Tasche [13] for an example of approach (a) from the area of credit risk. Regarding the interpretation of (9c) in approach (b) as maximum likelihood equations, see Du Plessis and Sugiyama [14] or Tasche [15]. This interpretation, in particular, implies that an EM (expectation maximisation) algorithm can be deployed for solving the equation system (Saerens et al. [4]).

4.2. The Binary Case

Theorem 2 does not provide sufficient or necessary conditions for the existence or uniqueness of solutions to equation system (9c) if a density h and a candidate class distribution ( q i ) i = 1 , , d are given. In the special case d = 2 , such an existence and uniqueness statement can be made as the following proposition shows. The following proposition is a generalisation of Section 4.2.4 of Tasche [13].
Proposition 2.
Let ( Ω , F , P ) be a probability space, H F a sub-σ-algebra of F and A F \ H with 0 < p = P [ A ] < 1 . Assume that P P [ A | H ] { 0 , 1 } = 0 .
Then, there exists a solution ϱ = ϱ 1 > 0 to (9c) with A 1 = A , A 2 = Ω \ A , p 1 = p = 1 p 2 and q 1 = q = 1 q 2 , if an H -measurable function h : Ω [ 0 , ) with E P [ h ] = 1 and a number 0 < q < 1 are given.
Assume additionally that H and A are not independent under P. Then, the solution ϱ to (9c) is unique. Denote by ϕ : ( 0 , 1 ) ( 0 , ) the function that maps, for a fixed density h, the number 0 < q < 1 to ϱ, i.e., ϕ ( q ) = ϱ . Then, ϕ has the following properties:
(i) 
ϕ is strictly increasing and continuous on ( 0 , 1 ) .
(ii) 
lim q 0 ϕ ( q ) = P [ A ] ( 1 P [ A ] ) E P h P [ A | H ] 1 P [ A | H ] .
(iii) 
lim q 1 ϕ ( q ) = P [ A ] 1 P [ A ] E P h 1 P [ A | H ] P [ A | H ] .
See Appendix B for a proof of Proposition 2. The uniqueness statement of Proposition 2 is interesting because it implies an answer to the question of whether proper concept shift (dataset shift where the marginal distributions of the features and labels, respectively, remain unchanged) can be modelled as factorizable joint shift. The answer—at least for the binary case—is no, because “no shift” then provides the only solution to Equation (9c).

5. Common Types of Dataset Shift

In this section, we revisit some popular special cases of dataset shift. In each case, we discuss the question if factorizable joint shift is implied or if the special type of shift is implied by factorizable joint shift. In addition, we provide in each case an adapted version of the posterior correction formula (6).

5.1. Prior Probability Shift

Moreno-Torres et al. [2] defined prior probability shift as invariance of the class-conditional feature distributions between source and target, i.e.,
Q i [ H ] = P i [ H ] , H H , i = 1 , , d ,
with Q i and P i defined as in (3) above, and Q [ A i ] P [ A i ] for at least one i. This type of dataset shift is also known as “target shift” [16], “global drift” [17], “label shift” [18] and under other names. In terms of the notation used in Theorem 1, (14a) is equivalent to having the densities of the Q i with respect to the P i on the feature information set H equal to 1, i.e.,
h i = 1 , i = 1 , , d .
By Theorem 1, (14b) implies for the density h ¯ of Q with respect to P on H ¯ that
h ¯ = i = 1 d Q [ A i ] P [ A i ] 1 A i = i = 1 d Q [ A i ] 1 A i i = 1 d P [ A i ] 1 A i ,
which obviously is an A -measurable function. Definition 3 of factorizable joint shift, therefore, is satisfied—as stated by He et al. [3] in Table 1.
The posterior correction formula (6) in this case takes the well-known form
Q [ A i | H ] = Q [ A i ] P [ A i ] P [ A i | H ] j = 1 d Q [ A j ] P [ A j ] P [ A j | H ] ,
as noted before, e.g., by Saerens et al. [4] and Elkan [5].

5.2. Covariate Shift

Moreno-Torres et al. [2] defined covariate shift as invariance of the posterior class probabilities between source and target, i.e.,
Q [ A i | H ] = P [ A i | H ] , i = 1 , , d ,
and Q [ H ] P [ H ] for at least one H H .
Proposition 3.
Under Assumption 2, denote by h ¯ and h, as in Section 3, the densities of Q with respect to P on H ¯ and H , respectively. Then, (17) holds true if and only if h is also a density of Q with respect to P on H ¯ , i.e., P [ h ¯ = h ] = 1 .
Proof. 
The “if” part of the assertion is Lemma 1 of Tasche [8]. Taking into account Notation 1, the “only if” is implied by Corollary 3. □
Proposition 3 implies that covariate shift is a special case of factorizable joint shift in the sense of Definition 3, with b = 1 and g = h , as noted in Table 1 of He et al. [3].
Then, observe that the fact that b is constant implies by (9a) that
ϱ i = Q [ A d ] P [ A d ] P [ A i ] Q [ A i ] , for all i = 1 , , d 1 .
It can readily be checked that under the assumption of covariate shift the ϱ i defined by (18) indeed solve equation system (9c).

5.3. Covariate Shift with Posterior Drift

Scott [11] defined covariate shift with posterior drift (CSPD) for the binary special case ( d = 2 ) of Assumption 1 as the following variant of (17):
there exists a strictly increasing function φ such that
Q [ A 1 | H ] = φ P [ A 1 | H ] .
Equation (19) implies that Q [ A 1 | H ] and P [ A 1 | H ] are strongly comonotonic. As shown in Tasche [19], the converse implication also holds true.
Note that from (19), it also follows that
Q [ A 2 | H ] = 1 φ 1 P [ A 2 | H ] .
Hence, the increasing link between the posterior positive class probabilities defining CSPD does not only apply to class A 1 but automatically also to the negative class A 2 .
CSPD is implied by factorizable joint shift. This follows from (12) because of
Q [ A 2 | H ] = φ * P [ A 2 | H ] ,
with φ * ( x ) = Q [ A 2 ] P [ A 2 ] x ϱ 1 Q [ A 1 ] P [ A 1 ] ( 1 x ) + Q [ A 2 ] P [ A 2 ] x which is strictly increasing in x.
Under CSPD, the class-conditional densities h i = d Q i d P i | H , i = 1 , 2 , introduced in Lemma 1 can be shown to be
h 1 = Q [ A 1 ] P [ A 1 ] h φ P [ A 1 | H ] P [ A 1 | H ] ) and h 2 = 1 Q [ A 1 ] 1 P [ A 1 ] h 1 φ P [ A 1 | H ] 1 P [ A 1 | H ] ) ,
where h is the density of Q with respect to P on the feature information set H . Alas, when used in connection with Theorem 1, (20) does not provide a very useful representation of h ¯ .

5.4. Domain Invariance

Translated into the concepts and notation of this paper, domain invariance (see Table 1 of He et al. [3]) is defined as follows:
  • There is an H -measurable mapping (transformation) T into some measurable space with the property that
    Q [ M ] = P [ M ] for all M σ A G ,
    where G = σ ( T ) denotes the smallest sub- σ -algebra of H such that T is still G -measurable.
  • For all i = 1 , d it holds that:
    Q [ A i | H ] = P [ A i | G ] and Q [ A i | H ] = Q [ A i | G ] .
Property (21b) means that T is sufficient for H under both P and Q in the sense of Section 32.3 of Devroye et al. [20].
As mentioned in He et al. [3], (21a) implies covariate shift with respect to G , i.e.,
Q [ A i | G ] = P [ A i | G ] , i = 1 , , d .
From (21b) then follows covariate shift with respect to H .
Actually, this reasoning shows that in the definition of domain invariance according to He et al. [3], (21a) could be replaced by the weaker assumption (21c), without losing the consequence that covariate shift holds on the whole information set H .

5.5. Generalised Label Shift

Tachet des Combes et al. [21] defined generalised label shift (GLS) as follows: there is an H -measurable mapping (transformation) T into some measurable space with the property that
Q [ G | A i ] = P [ G | A i ] , i = 1 , , d , G G = σ ( T ) .
Since σ ( T ) H holds, this is weaker than requiring (14a) as for prior probability shift. In this sense, GLS generalises prior probability shift.
He et al. [3] gave in Table 1 a narrower definition of GLS, by requiring in addition to (22) also (21b), and went on to prove that GLS implied factorizable joint shift. We provide an alternative proof of this result, providing mathematically rigorous meaning for the factorisation proposed by He et al.
Proposition 4.
Under Assumption 2, let there be an H -measurable mapping T into some measurable space such that (22) and (21b) hold. Denote by h the density of the target distribution Q with respect to the source distribution P on H . Then, Q and P are related by factorizable joint shift in the sense of Definition 3, with
b = i = 1 d Q [ A i ] P [ A i ] 1 A i and g = h i = 1 d Q [ A i ] P [ A i ] P [ A i | H ] .
See Appendix B for a proof of Proposition 4. Observe that Proposition 4 and Corollary 4 together imply that the same class posterior correction formula (16) applies for generalised label shift and prior probability shift.
The factorisation presented in (23) of Proposition 4 corresponds to the factorisation of generalised label shift proposed by He et al. [3] in Table 1 in the following way:
  • Function b matches D T ( Y ) of He et al. Because in this paper the reference measure is the source distribution P, D S ( Y ) of He et al. corresponds to constant 1 in (23).
  • Function h matches D T ( X ) and function γ (the denominator of g) matches D T ( Z ) of He et al. Due to our reference measure being P, D S ( X ) and D S ( Z ) of He et al. are both matched by constant 1. The term D T ( X = x | Z = g ( x ) ) appears in (23) as the density ratio g = h / γ , hence with a well-defined mathematical meaning.
Remark 2.
Proposition 4 combined with Remark 1 shows that “generalized label shift” in the sense of He et al. [3] is the same type of dataset shift that was discussed as “invariant density ratio”-type dataset shift in Tasche [22].

6. Sample Selection Bias

Sample selection bias is an important cause of dataset shift. In this subsection, we revisit parts of Hein [23] in order to illustrate some of the concepts and results presented before. We basically work under Assumption 1 but without the interpretation of P as source and Q as target distribution. Instead, P is interpreted as the distribution of a population from which a potentially biased random sample is taken, resulting in the distribution Q. When studying sample selection bias in this setting, the goal is to infer properties of P from properties of the sample distribution Q.
The following assumption describes the setting of this section. The idea is that under the population distribution, each object has a positive chance to be selected. This chance may depend upon the features (covariates) and the class of the object.
Assumption 3
(Sample selection). ( Ω , F ) is a measurable space. The population distribution P is a probability measure on ( Ω , F ) . For some positive integer d 2 , events A 1 , , A d F and a sub-σ-algebra H F are given. The events A i , i = 1 , , d and H have the following properties:
(i) 
i = 1 d A i = Ω .
(ii) 
A i A j = ø , i , j = 1 , d , i j .
(iii) 
0 < P [ A i ] , i = 1 , , d .
(iv) 
A i H , i = 1 , , d .
The selection probability is an H ¯ -measurable random variable 0 < φ 1 where the sub-σ-algebra H ¯ is defined as in (1b).
The probability space ( Ω , F , P ) also supports a random variable U which is uniformly distributed on [ 0 , 1 ] such that U and H ¯ are independent.
Definition 4
(Sample distribution). Under Assumption 3, define the event of being selected by S = { U φ } . The probability measure Q on ( Ω , F ) , defined by
Q [ F ] = P [ F | S ] = P [ F S ] P [ S ] , for F F ,
is called sample distribution.
Note that the measure Q is well-defined because from the independence of U and H ¯ , it follows that
P [ S ] = E P 0 1 1 [ 0 , φ ] ( u ) d u = E P [ φ ] > 0 .
Another consequence of the independence of U and H ¯ is
P [ S | H ¯ ] = P [ U φ | H ¯ ] = φ > 0 .
Proposition 5.
P and Q as described in Assumption 3 and Definition 4 satisfy Assumptions 1 and 2 with P as source distribution and Q as target distribution. Moreover, P is absolutely continuous with respect to Q on H ¯ .
Proof. 
It remains to show that
  • Q is absolutely continuous with respect to P on H ¯ , with density h ¯ = P [ S | H ¯ ] P [ S ] ;
  • P is absolutely continuous with respect to Q on H ¯ ;
  • 0 < Q [ A i ] for i = 1 , , d .
By definition of Q as P conditional on S, the sample distribution Q is absolutely continuous with respect to P on F and hence also on H ¯ F . For the density h ¯ , we obtain
h ¯ = E P 1 S P [ S ] | H ¯ = P [ S | H ¯ ] P [ S ] > 0 .
The fact that h ¯ is positive implies that P is absolutely continuous with respect to Q on H ¯ . Since A i H ¯ for i = 1 , , d , the absolute continuity of P with respect to Q implies Q [ A i ] > 0 , i = 1 , , d . □

6.1. Properties of the Sample Selection Model

Equation (25) implies for the density h of Q with respect to P on H
h = E P [ h ¯ | H ] = P [ S | H ] P [ S ] > 0 .
From representation (1b) of H ¯ , the following alternative description for P [ S | H ¯ ] follows:
P [ S | H ¯ ] = i = 1 d P [ A i S | H ] P [ A i | H ] 1 A i = i = 1 d P i [ S | H ] 1 A i ,
where the P i denote the class-conditional feature distributions under P, see Definition 2. P i [ S | H ] is accordingly the feature-conditional probability of being selected on the subpopulation of objects with class A i .
For i = 1 , , d and H H , a short calculation shows:
Q i [ H ] = P [ A i H S ] P [ A i S ] = E P 1 H P [ A i S | H ] P [ A i S ] = E P 1 H P [ A i | H ] P i [ S | H ] P [ A i ] P i [ S ] = E P 1 H A i P i [ S | H ] P [ A i ] P i [ S ] = E P i 1 H P i [ S | H ] P i [ S ] .
This implies
h i = d Q i d P i | H = P i [ S | H ] P i [ S ] , i = 1 , , d .
Equation (28) and Theorem 1 together imply the following alternative representation of h ¯ :
h ¯ = i = 1 d P i [ S | H ] P i [ S ] Q [ A i ] P [ A i ] 1 A i .
By the generalised Bayes formula (Lemma A1 in Appendix A), (25) implies the following representation of the posterior class probabilities Q [ A i | H ] , i = 1 , , d , under Q:
Q [ A i | H ] = E P [ 1 A i h ¯ | H ] E P [ h ¯ | H ] = E P 1 A i P [ S | H ¯ ] | H P [ S | H ] = P [ S A i | H ] P [ S | H ] .
Zadrozny [24] and Hein [23] observed that if the event S of being selected and the class labels as expressed by the σ -algebra A were independent conditional on H , the information set reflecting the features, then the population distribution P and the sample distribution Q were related by covariate shift. A consequence of (30) is that the converse of this observation actually also holds true, as stated in the following proposition.
Proposition 6.
In the sample selection model, as specified by Assumption 3 and Definition 4, the population distribution P and the sample distribution Q are related by covariate shift if and only if
P [ S A i | H ] = P [ S | H ] P [ A i | H ] , i = 1 , , d ,
i.e., if the event of being selected and the class labels are independent conditional on the features under the population distribution P.
Proof. 
Proposition 6 is obvious from (30) and the definition of covariate shift (17). □
In the case of general dataset shift caused by sample selection, Equation (30) does not provide information about how to compute the population posterior class probabilities P [ A i | H ] from the sample posterior class probabilities Q [ A i | H ] . Translated into the setting of this paper, Hein [23] presented in Equation (3.2) the following two ways to do so:
  • Define Q * as the distribution of the not-selected sample, i.e.,
    Q * [ F ] = P [ F | ( Ω \ S ) ] = P [ F ] P [ F S ] 1 P [ S ] , F F .
    Then, it holds that
    P [ A i | H ] = P [ A i S | H ] + P [ A i ( Ω \ S ) | H ] = Q [ A i | H ] P [ S | H ] + Q * [ A i | H ] ( 1 P [ S | H ] ) ,
    for i = 1 , , d .
  • Equation (30) can be written equivalently as
    Q [ A i | H ] = P i [ S | H ] P [ A i | H ] P [ S | H ] .
    Hence, on the event P i [ S | H ] > 0 , the following representation of P [ A i | H ] , i = 1 , , d , is obtained:
    P [ A i | H ] = P [ S | H ] P i [ S | H ] Q [ A i | H ] .
Both (31a) and (31b) are of limited practical usefulness, however, as on the one hand, (31a) requires knowledge of the class labels in the not-selected sample, which usually are not available. On the other hand, for (31b) to be applicable, class-wise probabilities of selection P i [ S | H ] must be estimated, which again requires knowledge of the class labels in the not-selected sample.

6.2. Sample Selection Bias and Factorizable Joint Shift

Proposition 6 provides an example of a condition for the sample selection process that makes the resulting bias between population and sample representable as covariate shift and, consequently, according to Section 5.2, as a special case of factorizable joint shift. Are there other selection procedures that entail factorizable joint shift?
We investigate this question by assuming that the population distribution P and the sample distribution Q are related by factorizable joint shift and then identifying the consequences this assumption implies for the class-wise feature-conditional selection probabilities P i [ S | H ] , i = 1 , , d .
Theorem 3.
Under Assumption 3 and Definition 4, let P and Q be related by factorizable joint shift in the sense of Definition 3, i.e., there are an H -measurable function g 0 and an A -measurable function b 0 such that the density h ¯ of Q with respect to P on H ¯ can be represented as h ¯ = g b . Then, the following statements hold true:
(i) 
Q and P are related by factorizable joint shift with an H -measurable function g * > 0 and an A -measurable function b * > 0 that can be represented up to a constant factor in the sense of (8b) as
b * = i = 1 d α i P [ A i ] Q [ A i ] 1 A i + P [ A d ] Q [ A d ] 1 A d and g * = P [ S ] P [ S | H ] i = 1 d α i P [ A i ] Q [ A i ] Q [ A i | H ] + P [ A d ] Q [ A d ] Q [ A d | H ] ,
where the constants 0 < α 1 , , α d 1 < satisfy the following equation system, with i = 1 , , d 1 :
Q [ A i ] = P [ S ] α i E Q Q [ A i | H ] P [ S | H ] j = 1 d α j P [ A j ] Q [ A j ] Q [ A j | H ] + P [ A d ] Q [ A d ] Q [ A d | H ] .
(ii) 
The population posterior probabilities P [ A i | H ] , i = 1 , , d , can be represented as functions of the sample posterior probabilities Q [ A i | H ] , i = 1 , , d , in the following way:
P [ A i | H ] = α i P [ A i ] Q [ A i ] Q [ A i | H ] j = 1 d α j P [ A j ] Q [ A j ] Q [ A j | H ] + P [ A d ] Q [ A d ] Q [ A d | H ] , i = 1 , , d 1 , P [ A d | H ] = P [ A d ] Q [ A d ] Q [ A d | H ] j = 1 d α j P [ A j ] Q [ A j ] Q [ A j | H ] + P [ A d ] Q [ A d ] Q [ A d | H ] ,
where the constants 0 < α 1 , , α d 1 < satisfy equation system (32b).
(iii) 
The class-wise feature-conditional selection probabilities P i [ S | H ] , i = 1 , , d , can be represented as
P i [ S | H ] = Q [ A i ] α i P [ A i ] P [ S | H ] j = 1 d α j P [ A j ] Q [ A j ] Q [ A j | H ] + P [ A d ] Q [ A d ] Q [ A d | H ] ,
where the constants 0 < α 1 , , α d 1 < satisfy equation system (32b) and α d = 1 .
Proof. 
Functions g and b must be positive since h ¯ is positive according to Proposition 5. Hence, Q and P are related by factorizable joint shift with decomposition b * = 1 / b and g * = 1 / g . Apply Theorem 2 with swapped roles of P and Q to obtain representation (32a) and equation system (32b). Statement (ii) follows immediately from Corollary 4.
Regarding (iii), use (28) and Proposition 1 (iv) together with (32a) to obtain
P i [ S ] P i [ S | H ] = α i g * , i = 1 , , d .
This is equivalent to (34). □
As mentioned in Section 4.1 as a potential application of Theorem 2, assuming that the posterior probabilities Q [ A i | H ] under the sample distribution can be estimated, Theorem 3 offers two obvious ways to learn the characteristics of factorizable joint shift:
(a)
If the population prior class probabilities P [ A i ] are known (for instance from external sources) solve (32b) for the constants α i .
(b)
If the population prior class probabilities P [ A i ] are unknown, fix values for the constants α i and solve (32b) for the P [ A i ] . Letting α i = 1 for all i is a natural choice that converts (32b) into the system of maximum likelihood equations for the P [ A i ] under the prior probability shift assumption.
In case (a), (34) may serve as an admissibility check for the solutions found. If the class-wise selection probabilities P i [ S | H ] obtained from (34) can take values greater than 100 % , the corresponding set of values ( α 1 , , α d 1 ) is not an admissible solution of (32b). If all solutions ( α 1 , , α d 1 ) of (32b) turn out to be inadmissible, it must be concluded that the assumption of factorizable joint shift for the sample selection process is wrong.
In case (b), from (34) follows for all i , j = 1 , , d
P i [ S | H ] P [ A i ] Q [ A i ] = P j [ S | H ] P [ A j ] Q [ A j ] ,
which implies
P i [ S | H ] Q [ A i ] P [ A i ] min P [ A 1 ] Q [ A 1 ] , , P [ A d ] Q [ A d ] , for all i = 1 , , d .
Inequality (35) provides a simple necessary criterion for the presence of factorizable joint shift with constants α i all equal to 1.
A further, less obvious special case of Theorem 3 is encountered if is assumed that
α i P [ A i ] Q [ A i ] = P [ A d ] Q [ A d ] , for all i = 1 , , d 1 .
Then, (34) implies P i [ S | H ] = P [ S | H ] for all i = 1 , , d . By (31b), this means that population distribution and sample distribution are related by covariate shift, as already observed by Hein [23].

7. Conclusions

We revisited the notion of “factorizable joint shift” recently introduced by He et al. [3]. A main finding is that factorizable joint shift is actually not much more general than prior probability shift or covariate shift. However, in contrast to these two types of shifts, factorizable joint shift is not fully identifiable if no class label information on the test (target) dataset is available and no additional assumptions are made. These findings are based on a representation result (Theorem 2) and a comparison of the class posterior correction formula (12) for factorizable joint shift to the related correction formulae (16) and (17) for prior probability and covariate shifts, respectively. Formula (12) is structurally identical with formula (16) but includes additional constants which can be found by solving the nonlinear equation system (9c).
He et al. [3] did not present the full rationale for their joint importance aligning approach to estimating the characteristics of factorizable joint shift. Hence, solving equation system (9c) for the additional constants in the posterior correction formula or for the prior class probabilities under the target distribution can be considered attractive alternative approaches.
Some open research questions remain:
  • Under what conditions can the existence and the uniqueness of solutions ( ϱ 1 , , ϱ d 1 ) to equation system (9c) be guaranteed in the case of more than two classes?
  • Is there any manageable—in the sense of having observable characteristics—type of dataset shift which is both more complex than factorizable joint shift and less complex than covariate shift with posterior drift?
  • To which extent can Theorem 2 be adapted for a more general regression setting?

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author thanks three anonymous reviewers for suggestions that helped to improve an earlier version of this paper.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. The Generalized Bayes Formula

Lemma A1 is Theorem 10.8 of Klebaner [25], slightly extended to explicitly cover the case when the denominator in the formula for the density can be 0.
Lemma A1.
Let ( Ω , F ) be a measurable space and P and Q probability measures on ( Ω , F ) . Assume that f = d Q d P is a density of Q with respect to P on F . Let G be a sub-σ-algebra of F and X be a non-negative random variable on ( Ω , F ) or a random variable on ( Ω , F ) such that f X is P-integrable. Then, the following two statements hold:
(i) 
{ f > 0 } E P [ f | G ] > 0 , in the sense of P f > 0 , E P [ f | G ] = 0 = 0 .
(ii) 
E Q [ X | G ] = E P [ f X | G ] E P [ f | G ] 1 { E P [ f | G ] > 0 } .
Proof. 
For (i): Observe that
E P [ f 1 { E P [ f | G ] = 0 } ] = E P E P [ f | G ] 1 { E P [ f | G ] = 0 } = 0 .
This implies
0 = P f 1 { E P [ f | G ] = 0 } > 0 = P f > 0 , E P [ f | G ] = 0 .
For (ii): see Klebaner [25], proof of Theorem 10.8. □

Appendix B. Proofs

Appendix B.1. Proof of Proposition 2

For a more concise notation, define the non-negative, H -measurable random variables R 1 and R 2 by
R 1 = P [ A | H ] p and R 2 = 1 P [ A | H ] 1 p .
Then, (9c) can be written as
1 = ϱ E P h R 1 ϱ q R 1 + ( 1 q ) R 2 .
Some algebra shows that (A1a) is equivalent to
1 = E P h R 2 ϱ q R 1 + ( 1 q ) R 2 ,
and that it is also equivalent to
0 = E P h ϱ R 1 R 2 ϱ q R 1 + ( 1 q ) R 2 .
Define the function g ( ϱ ) = E P h R 2 ϱ q R 1 + ( 1 q ) R 2 for ϱ 0 . Then, it holds that
  • g ( ϱ ) 1 1 q < for all ϱ 0 ;
  • g ( 0 ) = 1 1 q > 1 ;
  • By the dominated convergence theorem, g is continuous for 0 ϱ < with lim ϱ g ( ϱ ) = 0 .
By the mean value theorem, these properties of g imply the existence of some ϱ > 0 with g ( ϱ ) = 1 . By the equivalence of (A1a) and (A1b), the existence of a positive solution ϱ to (9c) follows.
Regarding the uniqueness of the solution to (9c), define for q ( 0 , 1 ) and ϱ ( 0 , ) the function f ( q , ϱ ) = E P h ϱ R 1 R 2 ϱ q R 1 + ( 1 q ) R 2 . Then, f is continuously partially differentiable with
f q ( q , ϱ ) = E P h ( ϱ R 1 R 2 ) 2 ( ϱ q R 1 + ( 1 q ) R 2 ) 2 , and f ϱ ( q , ϱ ) = E P h R 1 R 2 ( ϱ q R 1 + ( 1 q ) R 2 ) 2 .
The assumption P P [ A | H ] { 0 , 1 } = 0 implies f ϱ ( q , ϱ ) > 0 for all 0 < q < 1 and ϱ > 0 .
P [ ϱ R 1 R 2 = 0 ] = 1 would imply P [ A | H ] = p and as a further consequence A and H would be independent. By assumption, this is not the case, and hence, P [ ϱ R 1 R 2 = 0 ] < 1 . This implies also f q ( q , ϱ ) > 0 for all 0 < q < 1 and ϱ > 0 .
Consequently, by the implicit function theorem, there exists a continuously differentiable function ϕ : ( 0 , 1 ) ( 0 , ) , q ϕ ( q ) = ϱ such that f ( q , ϕ ( q ) ) = 0 for all 0 < q < 1 and
ϕ ( q ) = E P h ( ϕ ( q ) R 1 R 2 ) 2 ( ϕ ( q ) q R 1 + ( 1 q ) R 2 ) 2 E P h R 1 R 2 ( ϕ ( q ) q R 1 + ( 1 q ) R 2 ) 2 > 0 .
This proves claim (i) on ϕ and the existence of lim q 0 ϕ ( q ) < and lim q 1 ϕ ( q ) > 0 . Making use again of the equivalence of (A1c) and (A1a) and invoking Lemma 4.1 of Tasche [15] now implies for all 0 < q < 1
1 E P h R 1 R 2 < ϕ ( q ) < E P h R 2 R 1 .
These inequalities also hold true if E P h R 1 R 2 = or E P h R 2 R 1 = .
Now, apply Fatou’s lemma to obtain
1 = lim inf q 1 E P h R 2 ϕ ( q ) q R 1 + ( 1 q ) R 2 E P h R 2 R 1 1 lim q 1 ϕ ( q ) .
From this, lim q 1 ϕ ( q ) E P h R 2 R 1 follows, and by (A2), also claim (iii).
Another application of Fatou’s lemma gives
1 = lim inf q 0 ϕ ( q ) E P h R 1 ϕ ( q ) q R 1 + ( 1 q ) R 2 E P h R 1 R 2 lim q 1 ϕ ( q ) .
Together with (A2), this proves claim (ii) and completes the proof.

Appendix B.2. Proof of Proposition 4

Observe that function g in (23) is well-defined because the denominator
γ = i = 1 d Q [ A i ] P [ A i ] P [ A i | G ] = i = 1 d Q [ A i ] P [ A i ] P [ A i | H ]
on the right-hand side of the equation is always positive. On the one hand, by Corollary 2, we obtain for i = 1 , , d on the set { h > 0 }
Q [ A i | H ] = h i Q [ A i ] P [ A i ] P [ A i | H ] h ,
where h i denotes the density of the target class-conditional feature distribution Q i with respect to the source class-conditional feature distribution P i on H .
On the other hand, by combining the prior probability shift property (22) on G = σ ( T ) , the sufficiency property (21b) and (14b), Corollary 2 implies
Q [ A i | H ] = Q [ A i ] P [ A i ] P [ A i | H ] γ ,
Hence, from (A4a) and (A4b), it follows for i = 1 , , d
P [ A i | H ] γ = h i P [ A i | H ] h on { h > 0 } .
Making use of (A4c), we obtain for any F = i = 1 d ( A i H i ) H ¯
Q [ F ] = E Q Q [ F | F ] = i = 1 d E P h 1 H i Q [ A i | H ] = i = 1 d E P h 1 { h > 0 } 1 H i h i Q [ A i ] P [ A i ] P [ A i | H ] h = i = 1 d E P h 1 H i Q [ A i ] P [ A i ] P [ A i | H ] γ = E P h γ i = 1 d Q [ A i ] P [ A i ] 1 H i P [ A i | H ] = E P h γ i = 1 d Q [ A i ] P [ A i ] 1 A i H i = E P 1 F h γ i = 1 d Q [ A i ] P [ A i ] 1 A i .
This proves that b g with b and g as defined by (23) is a density of Q with respect to P on H ¯ . As the A -measurability of b and H -measurability of g are obvious, the proof is complete.

References

  1. Storkey, A. When Training and Test Sets Are Different: Characterizing Learning Transfer. In Dataset Shift in Machine Learning; Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N., Eds.; The MIT Press: Cambridge, MA, USA, 2009; Chapter 1; pp. 3–28. [Google Scholar]
  2. Moreno-Torres, J.; Raeder, T.; Alaiz-Rodriguez, R.; Chawla, N.; Herrera, F. A unifying view on dataset shift in classification. Pattern Recognit. 2012, 45, 521–530. [Google Scholar] [CrossRef]
  3. He, H.; Yang, Y.; Wang, H. Domain Adaptation with Factorizable Joint Shift. arXiv 2021, arXiv:2203.02902. [Google Scholar] [CrossRef]
  4. Saerens, M.; Latinne, P.; Decaestecker, C. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Comput. 2001, 14, 21–41. [Google Scholar] [CrossRef] [PubMed]
  5. Elkan, C. The Foundations of Cost-Sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, IJCAI 2001, Seattle, WA, USA, 4–10 August 2001; Nebel, B., Ed.; Morgan Kaufmann: San Francisco, CA, USA, 2001; pp. 973–978. [Google Scholar]
  6. Billingsley, P. Probability and Measure, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 1986. [Google Scholar]
  7. Klenke, A. Probability Theory: A Comprehensive Course; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  8. Tasche, D. Class Prior Estimation under Covariate Shift: No Problem? arXiv 2022, arXiv:2206.02449. [Google Scholar] [CrossRef]
  9. Holzmann, H.; Eulert, M. The role of the information set for forecasting—With applications to risk management. Ann. Appl. Stat. 2014, 8, 595–621. [Google Scholar] [CrossRef]
  10. Johansson, F.; Sontag, D.; Ranganath, R. Support and Invertibility in Domain-Invariant Representations. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019; Chaudhuri, K., Sugiyama, M., Eds.; Volume 89, pp. 527–536. [Google Scholar]
  11. Scott, C. A Generalized Neyman-Pearson Criterion for Optimal Domain Adaptation. In Proceedings of the Machine Learning Research, 30th International Conference on Algorithmic Learning Theory, Chicago, IL, USA, 22–24 March 2019; Volume 98, pp. 1–24. [Google Scholar]
  12. González, P.; Castaño, A.; Chawla, N.; Coz, J.D. A Review on Quantification Learning. ACM Comput. Surv. 2017, 50, 74:1–74:40. [Google Scholar] [CrossRef]
  13. Tasche, D. The art of probability-of-default curve calibration. J. Credit. Risk 2013, 9, 63–103. [Google Scholar] [CrossRef]
  14. Du Plessis, M.; Sugiyama, M. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Netw. 2014, 50, 110–119. [Google Scholar] [CrossRef] [PubMed]
  15. Tasche, D. The Law of Total Odds. arXiv 2013, arXiv:1312.0365. [Google Scholar] [CrossRef]
  16. Zhang, K.; Schölkopf, B.; Muandet, K.; Wang, Z. Domain Adaptation Under Target and Conditional Shift. In Proceedings of the 30th International Conference on International Conference on Machine Learning—Volume 28, ICML’13, Atlanta, GA, USA, 17–19 June 2013; pp. III-819–III-827. [Google Scholar]
  17. Hofer, V.; Krempl, G. Drift mining in data: A framework for addressing drift in classification. Comput. Stat. Data Anal. 2013, 57, 377–391. [Google Scholar] [CrossRef]
  18. Lipton, Z.; Wang, Y.X.; Smola, A. Detecting and Correcting for Label Shift with Black Box Predictors. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Volume 80, pp. 3122–3130. [Google Scholar]
  19. Tasche, D. Calibrating sufficiently. Statistics 2021, 55, 1356–1386. [Google Scholar] [CrossRef]
  20. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
  21. Tachet des Combes, R.; Zhao, H.; Wang, Y.X.; Gordon, G. Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 19276–19289. [Google Scholar]
  22. Tasche, D. Fisher Consistency for Prior Probability Shift. J. Mach. Learn. Res. 2017, 18, 1–32. [Google Scholar]
  23. Hein, M. Binary Classification under Sample Selection Bias. In Dataset Shift in Machine Learning; Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N., Eds.; The MIT Press: Cambridge, MA, USA, 2009; Chapter 3; pp. 41–64. [Google Scholar]
  24. Zadrozny, B. Learning and Evaluating Classifiers under Sample Selection Bias. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML’04, Banff, AB, Canada, 4–8 July 2004; Association for Computing Machinery: New York, NY, USA, 2004. [Google Scholar]
  25. Klebaner, F. Introduction to Stochastic Calculus with Applications, 2nd ed.; Imperial College Press: London, UK, 2005. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Tasche, D. Factorizable Joint Shift in Multinomial Classification. Mach. Learn. Knowl. Extr. 2022, 4, 779-802. https://doi.org/10.3390/make4030038

AMA Style

Tasche D. Factorizable Joint Shift in Multinomial Classification. Machine Learning and Knowledge Extraction. 2022; 4(3):779-802. https://doi.org/10.3390/make4030038

Chicago/Turabian Style

Tasche, Dirk. 2022. "Factorizable Joint Shift in Multinomial Classification" Machine Learning and Knowledge Extraction 4, no. 3: 779-802. https://doi.org/10.3390/make4030038

Article Metrics

Back to TopTop