Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Approximately Optimal Domain Adaptation with Fisher’s Linear Discriminant

Mathematics 2024, 12(5), 746; https://doi.org/10.3390/math12050746

by Hayden Helm^1,*, Ashwin de Silva²

, Joshua T. Vogelstein², Carey E. Priebe³ and Weiwei Yang¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Cesar Isaza

Reviewer 4: Anonymous

Mathematics 2024, 12(5), 746; https://doi.org/10.3390/math12050746

Submission received: 10 January 2024 / Revised: 20 February 2024 / Accepted: 22 February 2024 / Published: 1 March 2024

(This article belongs to the Special Issue Statistical Analysis: Theory, Methods and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The reviewer would like to thank the authors for submitting their work. A series of high level comments on the presented manuscript is given below.

Introduction: Add paragraph with high level detail on how the method works. How is it different from others, what's the high level idea, what are the challenges, why is it a good idea.
line 39: The notion of optimality would need to defined.
line 100: Why is the task to only minimize the expected loss with respect to P^0, how about the other distributions.
line 134: Can the authors give more insights for this constraint?
line 139: Can the authors give a derivation or reference of eq.(3)?
Algorithm 1: What do the authors mean by the variance of the vector mu-hat in line 3? What is the underlying distribution? Also what do the authors mean by "For each alpha in [0,1]" for a continuous interval [0,1]? -- do they perhaps mean for all alpha on a grid which is a subset of [0,1]? What is the grid?
line 185: For the real dataset experiments, it would be good to state why the specific datasets have been selected, aka what qualifies them as good benchmarks? Are they used commonly in the literature for those tasks? Please cite other publications using them.
line 198: Why would 1000 iterations imply the standard error of each sample is zero. This entirely depends on the distribution you sample from. Can the authors provide the distributions?
line 232: This is a very strong claim. Can the authors justify this based on one simulation study alone? Such a statement would make sense with a proof, or with a much more exhaustive simulation study.
line 284: force the class conditional class
line 358: a maixmum improvement
line 363: This is not how p-values are stated. Better say <10e-10 or something. P-values should never be zero.

Author Response

Thank you for taking the time to review our manuscript. Your comments, questions, and suggestions helped us both improve the draft and improve our understanding of our contribution.

We have responded to each of your comments, questions, and suggestions in line below:

Introduction: Add paragraph with high level detail on how the method works. How is it different from others, what's the high level idea, what are the challenges, why is it a good idea.

We have added a high level description of how the method works and how it is different from others to the introduction.

line 39: The notion of optimality would need to defined.

Thank you for pointing this out – we have qualified our original statement with “under 0-1 classification loss".

line 100: Why is the task to only minimize the expected loss with respect to P^0, how about the other distributions.

In the “classical” transfer learning and domain adaptation settings, you have access to data (or artifacts thereof) from source tasks and data from a target task and the goal is to construct a classifier that performs as-well-as-possible on test data from the target distribution P^0.

For other settings, such as multi-task learning or continual learning, classifiers are constructed to perform well on some aggregate metric that summarizes performance across all previously seen tasks P^0, .., P^J.

Our work considers the transfer learning setting so we only minimize the expected loss with respect to P^0. We’ve added a sentence to Section 1.2 to clarify this.

line 134: Can the authors give more insights for this constraint?

Yes. In fact, the constraint, as previously stated, was incorrect and was an artifact of a previous parameterization of the problem.

The constraint should be that the L2-norm of the inverse covariance matrix and the class conditional mean is 1. We have fixed this in the draft and have added a sentence to help the reader more easily see this constraint.

line 139: Can the authors give a derivation or reference of eq.(3)?

We have added more detailed steps of the derivation of the RHS of eq.(3).

Algorithm 1: What do the authors mean by the variance of the vector mu-hat in line 3? What is the underlying distribution? Also what do the authors mean by "For each alpha in [0,1]" for a continuous interval [0,1]? -- do they perhaps mean for all alpha on a grid which is a subset of [0,1]? What is the grid?

Our proposed method combines two vectors – the projection vector “learned” from the small amount of data from P^0 and the average of the source projection vectors. With the assumption that the source projection vectors are from a von Mises-Fisher distribution, the average of the source projection vectors is approximately normally distributed with covariance proportional to the identity matrix. When we say “variance” of mu-hat in line 3 of Algorithm 1, we are referring to the covariance matrix of the average source vector via this normal approximation. To emphasize this, we have changed “variance” to “approx-cov”.

Yes, you are right, we mean a grid that is a subset of [0,1]. We have added a grid size parameter h to Algorithm 1 and have updated the for loop accordingly. In our experiments we use the grid {0, 0.1, 0.2, .., 1}, which was described appropriately in the first version of the draft.

line 185: For the real dataset experiments, it would be good to state why the specific datasets have been selected, aka what qualifies them as good benchmarks? Are they used commonly in the literature for those tasks? Please cite other publications using them.

We have added a few paragraphs at the beginning of the real data experiments describing why each of the benchmarks was selected and citing some other publications that use them (if applicable).

line 198: Why would 1000 iterations imply the standard error of each sample is zero. This entirely depends on the distribution you sample from. Can the authors provide the distributions?

The accuracy and the optimal convex combination coefficient are both bounded between 0 and 1, so the variance of either is bounded above by ¼. Hence, the standard error (square root of the variance divided by the square root of the number of iterations) of the reported average value is bounded above by 0.016.

We provide more details related to the distributions of the accuracy and optimal convex combination coefficient in the real data examples.

line 232: This is a very strong claim. Can the authors justify this based on one simulation study alone? Such a statement would make sense with a proof, or with a much more exhaustive simulation study.

We agree that asserting that the approximation is generally appropriate is a very strong claim. We have added a qualifier to temper it and to give the reader an understanding of what additional simulations should be considered to better understand the appropriateness of the approximation.

line 284: force the class conditional class

Addressed, thank you!

line 358: a maixmum improvement

Addressed, thank you!

line 363: This is not how p-values are stated. Better say <10e-10 or something. P-values should never be zero.

Thank you for pointing this out. We now report p-values that are less than 0.001 more appropriately.

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

I enjoyed reading your manuscript and I have some comments for you to improve the quality of manuscript.

- 1. The text provides a detailed explanation of distributional assumptions and the formulation of the binary classification distribution (P(j)). The clarity aids understanding, but a brief clarification on the practical implications or real-world examples could enhance accessibility for non-experts.
- 2. The adoption of the von Mises-Fisher (vMF) distribution for optimal projection vectors is introduced, but the rationale behind choosing this distribution is not explicitly discussed. Could you elaborate on why the vMF distribution is suitable for this context and what benefits it offers over alternative distributions?
- 3. While the technical details are well-presented, a connection to real-world scenarios or applications could strengthen the text's relevance. How do these distributional assumptions and classifier parameters translate into practical machine learning tasks or domains? Can you provide an example where this framework might be applied and yield meaningful results?

Comments on the Quality of English Language

Overall, the quality of the English language is fine, and the text effectively communicates complex concepts.

Author Response

Thank you for taking the time to review our manuscript. Your comments, questions, and suggestions helped us both improve the draft and improve our understanding of our contribution.

We have responded to each of your comments, questions, and suggestions in line below:

The text provides a detailed explanation of distributional assumptions and the formulation of the binary classification distribution (P(j)). The clarity aids understanding, but a brief clarification on the practical implications or real-world examples could enhance accessibility for non-experts.

Thank you for your suggestion. We have added a few sentences on practical implications in the first and second paragraphs before describing the real data results.

In summary, for the physiological prediction problem (which is our main motivator) a person wears a device that collects, processes, and classifies a segment of recorded data.

For many modalities, such as EEG and ECG, the distribution of the data that is going to be classified is highly variable across persons, recording sessions, tasks, etc. and previously learned classifiers do not work out of the box. This means that for every session, the device must record “calibration” data and either train a new classifier or update existing classifiers.

The method proposed and studied in the draft is an “approximately optimal” method that combines the previously learned classifiers and new, highly relevant data that is relevant to predictions for the current session. In particular, we can greatly reduce the calibration time for EEG-based cognitive load, EEG-based stress, and ECG-based stress prediction by using our proposed method.

The adoption of the von Mises-Fisher (vMF) distribution for optimal projection vectors is introduced, but the rationale behind choosing this distribution is not explicitly discussed. Could you elaborate on why the vMF distribution is suitable for this context and what benefits it offers over alternative distributions?

Great question.

Our choice of modeling the projection vectors as realizations of a vMF distribution was motivated by our observation that the majority of baselines in the physiological prediction problem are still polynomial (or even linear) classifiers and by our desire to make the relationship between different tasks explicit. To these ends, we realized that linear classifiers can be thought of as a (projection vector, threshold) pair and that it is “natural” to assume that related classification problems are a unitary rotation of each other.

The vMF distribution is a generative distribution for exactly these situations -- with the additional constraint that realizations are on the unit sphere -- and is notably flexible for modeling things on the unit sphere (with limiting parameter settings creating a uniform distribution on the unit sphere on one end and a Gaussian-then-point-mass on the other end).

The vMF distribution is also particularly well suited for modeling projection vectors learned from Gaussian data. In particular, the crux of our method depends on the limiting distribution of a vector that is a convex combination of a projection vector from a two-class Gaussian classification problem and an average source vector. The projection vector from the two-class Gaussian classification problem is itself Gaussian. The normalized average source vector is Gaussian in the limit (i.e., with enough source tasks). Together, the convex combination is Gaussian in the limit, which allows us to propose the computable approximation to the expected risk.

While the technical details are well-presented, a connection to real-world scenarios or applications could strengthen the text's relevance. How do these distributional assumptions and classifier parameters translate into practical machine learning tasks or domains? Can you provide an example where this framework might be applied and yield meaningful results?

The distributional assumptions that we discuss while deriving an approximation for the expected risk for a classifier in our proposed class of classifiers are a means to an end. In particular, by providing a computationally-friendly way to find an “optimal” classifier, we are able to avoid costly searches over the space of possible classifiers while having confidence that the classifier is competitive.

The distributional assumptions will not hold in practice. Instead, as we suggest in the draft, practitioners can use the little in-session calibration data that is available to transform their data to be conformant with the assumptions. We do this in the real data examples that we discuss in the text to great success.

We expect our method -- and methods like it -- to be highly relevant for improving the usability (by decreasing required calibration time) of Human-Computer interfaces in general.

Reviewer 3 Report

Comments and Suggestions for Authors

General Comments: The subject addressed in this article; "Approximately Optimal Domain Adaptation with Fisher’s Linear Discriminant" is worthy of investigation. The authors propose a method to deal with the problem of interpolation between the classical and modern approaches for a specific set of classifiers.

Strengths:

• A good revision of the state-of-the-art was done.

• The description of the method and the mathematical formulation was well done and clear for readers.

Weaknesses:

• The authors must add information about the computational cost, and processing time.

Comments for author File: Comments.pdf

Author Response

Thank you for taking the time to review our manuscript. Your comments, questions, and suggestions helped us both improve the draft and improve our understanding of our contribution.

We have responded to each of the described weaknesses below:

The authors must add information about the computational cost, and processing time.

Thank you for your suggestion. We have added a short section (4.6) in the draft to comment on the computational cost of calculating the three projection vectors under study.

Reviewer 4 Report

Comments and Suggestions for Authors

The paper proposes a classifier that can leverage information from both the target and source distributions. The proposed classifier is the convex combination of an average of the source task classifiers and a classifier trained on the limited data available for the target task. The results are validated with simulated data and then real physiological prediction settings. This is a nicely written article.

I have the following minor concerns:

Figure 1 comes too early in the paper. I would bring it closer to the page where it is mentioned in the text.
Similar comment to the one above. Figures 4 and 5 are far away from the associated text and it is difficult to read and look at the results.
Is there a particular reason why the average source classifier was almost always the worst? Would improving the accuracy of the average source classifier change the results?
In Figures 4–6, is there a way to overlay the variance onto the accuracy comparisons (top-left figures)?
In the reference list, some journal names are capitalized some are not (e.g., IEEE transactions on pattern analysis and machine intelligence vs. Information Fusion). I think it is better if they are written consistently.
The journal names should be typed consistently (e.g., Journal of Machine Learning Research vs. J. Mach. Learn. Res.)

Author Response

Thank you for taking the time to review our manuscript. Your comments, questions, and suggestions helped us both improve the draft and improve our understanding of our contribution.

We have responded to each of your comments, questions, and suggestions in line below:

1 Figure 1 comes too early in the paper. I would bring it closer to the page where it is mentioned in the text.

We have moved Figure 1 closer to the associated text.

2 Similar comment to the one above. Figures 4 and 5 are far away from the associated text and it is difficult to read and look at the results.

We have tried to move Figures 4 and 5 closer to the associated text.

3 Is there a particular reason why the average source classifier was almost always the worst? Would improving the accuracy of the average source classifier change the results?

Yes. Using the average source classifier is a relatively naive classification strategy -- it assumes that the true-but-unknown projection vector for the target task is exactly the average of the source tasks. For the settings that we study, and for physiological prediction in general, the projection vectors have a high variance (see Figure 7) and it is thus unlikely that any of the individual projection vectors is the same as the average. With that said, there does seem to be settings, such as EEG-based Cognitive Load Classification where the average source classifier is better than the target classifier for extremely low data regimes.

Improving the accuracy of the average source classifier would likely not change the general qualitative observation “Optimal combination > Average source and Optimal combination > target only” for the majority of the amount of data regime, though it would change the absolute accuracy of the optimal combination.

4 In Figures 4–6, is there a way to overlay the variance onto the accuracy comparisons (top-left figures)?

Thank you for your suggestion. We think that overlaying the variance onto the accuracy comparisons makes the figures too busy.

We address comparing average / median performance of the two most interesting classifiers (optimal and target) more directly via the paired histograms.

5 In the reference list, some journal names are capitalized some are not (e.g., IEEE transactions on pattern analysis and machine intelligence vs. Information Fusion). I think it is better if they are written consistently.

We have tried to properly capitalize journals, conferences, etc.

6 The journal names should be typed consistently (e.g., Journal of Machine Learning Research vs. J. Mach. Learn. Res.)

We have tried to address this.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you, the authors have satisfactorily addressed almost all comments. Upon reading the first revision, I would have another few remarks on the manuscript:

line 34: "For other methods developed in the transfer learning or domain adaptation settings, the expected risk is typically unavailable and computationally expensive heuristics need to be used to find the optimal available classifier."
-- Is this relevant for the paper?

line 104: "however, there is often not enough data from the target task to adequately train classifiers and we assume, instead, that there is auxillary data (or derivatives thereof) from different contexts available that can be used to improve the expected loss."
-- Would it be possible to give a reference for this approach, it seems it is widely adopted in the community as this scenario seems quite commonplace.

line 110: "Note that for"
-- This might have been the explanation of the transfer learning setting in which the authors only minimize the expected loss with respect to P^0. However, this sentence is incomplete.

Algorithm 1: The authors have given a detailed description on the computation of the covariance of mu-hat and w-hat in the response. It seems to me that this is not described in the text. I would suggest adding their explanation ("With the assumption that the source projection vectors are from a von Mises-Fisher distribution, the average of the source projection vectors is approximately normally distributed with covariance proportional to the identity matrix. When we say 'variance' of mu-hat in line 3 of Algorithm 1, we are referring to the covariance matrix of the average source vector via this normal approximation.") also in the text describing Algorithm 1.

Author Response

Thank you, again, for your time and thoughts.

We have attempted to respond to your comments below.

We originally thought that it was relevant, but have decided to remove it as we believe we make this point in the related works section.

line 104: "however, there is often not enough data from the target task to adequately train classifiers and we assume, instead, that there is auxillary data (or derivatives thereof) from different contexts available that can be used to improve the expected loss."
-- Would it be possible to give a reference for this approach, it seems it is widely adopted in the community as this scenario seems quite commonplace.

We added a reference to the standard survey on the literature related to transfer learning.

line 110: "Note that for"
-- This might have been the explanation of the transfer learning setting in which the authors only minimize the expected loss with respect to P^0. However, this sentence is incomplete.

Yes, that is exactly what that incomplete sentence was intended to be. We have completed it. Thank you for pointing this out!

Algorithm 1: The authors have given a detailed description on the computation of the covariance of mu-hat and w-hat in the response. It seems to me that this is not described in the text. I would suggest adding their explanation ("With the assumption that the source projection vectors are from a von Mises-Fisher distribution, the average of the source projection vectors is approximately normally distributed with covariance proportional to the identity matrix. When we say 'variance' of mu-hat in line 3 of Algorithm 1, we are referring to the covariance matrix of the average source vector via this normal approximation.") also in the text describing Algorithm 1.

We have added the sentence “We describe the exact procedure for calculating the optimal classifier in Algorithm 1, where Approx-Cov returns Ψ as described in Eq. (5)” to the end of Section 2.4. We believe this achieves the same as your suggestion, but please let us know if you think otherwise.

Article Menu

Approximately Optimal Domain Adaptation with Fisher’s Linear Discriminant

Further Information

Guidelines

MDPI Initiatives

Follow MDPI