Next Article in Journal
Defuzzify Imprecise Numbers Using the Mellin Transform and the Trade-Off between the Mean and Spread
Next Article in Special Issue
A Discrete Partially Observable Markov Decision Process Model for the Maintenance Optimization of Oil and Gas Pipelines
Previous Article in Journal
Using Explainable AI (XAI) for the Prediction of Falls in the Older Population
Previous Article in Special Issue
Estimating Tail Probabilities of Random Sums of Phase-Type Scale Mixture Random Variables
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Coordinate Descent for Variance-Component Models

1
School of Mathematics and Statistics, University of New South Wales, Sydney, NSW 2052, Australia
2
School of Mathematical and Physical Sciences, Macquarie University, Sydney, NSW 2109, Australia
*
Authors to whom correspondence should be addressed.
Algorithms 2022, 15(10), 354; https://doi.org/10.3390/a15100354
Submission received: 30 July 2022 / Revised: 23 September 2022 / Accepted: 23 September 2022 / Published: 28 September 2022
(This article belongs to the Special Issue Algorithms in Monte Carlo Methods)

Abstract

:
Variance-component models are an indispensable tool for statisticians wanting to capture both random and fixed model effects. They have applications in a wide range of scientific disciplines. While maximum likelihood estimation (MLE) is the most popular method for estimating the variance-component model parameters, it is numerically challenging for large data sets. In this article, we consider the class of coordinate descent (CD) algorithms for computing the MLE. We show that a basic implementation of coordinate descent is numerically costly to implement and does not easily satisfy the standard theoretical conditions for convergence. We instead propose two parameter-expanded versions of CD, called PX-CD and PXI-CD. These novel algorithms not only converge faster than existing competitors (MM and EM algorithms) but are also more amenable to convergence analysis. PX-CD and PXI-CD are particularly well-suited for large data sets—namely, as the scale of the model increases, the performance gap between the parameter-expanded CD algorithms and the current competitor methods increases.

1. Introduction

Linear models that contain both fixed and random effects are referred to as variance components or linear-mixed models (LMMs). They arise in numerous applications, such as genetics [1], biology, economics, epidemiology and medicine. A broad coverage of existing methodologies and applications of these models can be found in the textbooks [2,3].
In the simplest variance component setup, we observe a response vector y R n and a predictor matrix X R n × p and assume that y is an outcome of a normal random variable Y N ( X β , Ω ) , where the covariance is of the form,
Ω = i = 0 m γ i V i R n × n , γ i 0 .
The matrices V 0 , , V m are fixed positive semi-definite matrices, and V 0 is non-singular. The unknown mean effects β = ( β 1 , , β p ) and variance component parameters γ = ( γ 0 , , γ m ) can be estimated by maximizing the log-likelihood function,
L ( β , γ ) = 1 2 ln det Ω 1 2 y X β Ω 1 y X β .
If Ω is known, the maximum likelihood estimator (MLE) for β is given by
β ^ = X Ω 1 X 1 X Ω 1 y .
To simplify the MLE estimate for γ , one can adopt the restricted MLE (REML) method [4] to remove the mean effect in the likelihood expression by projecting y onto the null space of X . Let ν = n p and suppose we have the QR decomposition,
X = Q [ p ] Q [ ν ] R 0 ν × p ,
where R is an p × p upper triangular matrix, 0 ν × p is an ν × p zero matrix, Q [ p ] is an n × p matrix, Q [ ν ] is n × ν , and Q [ p ] and Q [ ν ] both have orthogonal columns. If we take the Cholesky decomposition L L of the matrix Q [ ν ] V 0 Q [ ν ] , then the transformation, L 1 Q [ ν ] : R n R ν removes the mean from the response, and we obtain y : = L 1 Q [ ν ] y N ( 0 , Ω ) , where Ω : = γ 0 I + i = 1 m γ i V i R ν × ν and V i : = L 1 Q [ ν ] V i Q [ ν ] L . After this transformation, the restricted likelihood 1 2 ln det Ω 1 2 ( y ) ( Ω ) 1 y will not depend on β .
Henceforth, and without loss of generality, we assume that such a transformation has been performed so that we can focus on minimizing an objective function of the form:
2 L ( γ ) = ln det Ω + y Ω 1 y .
There exists extensive literature on optimization methods for the log-likelihood expression (2), including Newton’s Method [5], Fisher Scoring Method [6], the EM and MM Algorithms [7,8,9]. Newton’s method is known to scale poorly as n, or the number of variance components m + 1 , increase due the cost of O ( m n 3 ) + O ( m 3 ) flops required to invert a Hessian matrix at each update. Both the EM and MM algorithms have simple updating steps; however, numerical experience shows that they are slow to identify the active set { i : γ ^ i = 0 } , where γ ^ is the MLE.
One class of algorithms yet to be applied to this problem are coordinate-descent (CD) algorithms. These algorithms successively minimize the objective function along coordinate directions and can be effective when the optimization for each sub-problem can be made sufficiently simple. Furthermore, only few assumptions are needed to prove that accumulation points of the iterative sequence are stationary points of the objective function. CD algorithms have been used to solve optimization problems for many years, and their popularity has grown considerably in the past few decades because of their usefulness in data science and machine learning tasks. Further in-depth discussions of CD algorithms can be found in the articles [10,11,12].
In this paper, we show that a basic implementation of CD is costly for large-scale problems and does not easily succumb to standard convergence analysis. In contrast, our novel coordinate-descent algorithm called parameter-expanded coordinate descent (PX-CD) is computationally faster and more amenable to theoretical guarantees.
The PX-CD is computationally cheaper to run than the basic CD implementation because the first and second derivatives for each sub-problem can be evaluated efficiently with the conjugate gradient (CG) algorithm [13], whereas the basic CD implementation requires repeat Cholesky factorizations for each coordinate update, each with a complexity of O ( n 3 ) . Further to this, it is often the case that the V i are low-rank, and we can take advantage of this by employing the well-known Woodbury matrix identity or QR transformation within PX-CD to reduce the computational cost of each univariate minimization.
In PX-CD, the extended parameters are treated as a block of coordinates, which is updated at each iteration by searching through a coordinate hyper-plane rather than single-coordinate directions. We provide an alternate version of PX-CD, which we call parameter expanded-immediate coordinate descent (PXI-CD), where the extended coordinate block is updated multiple times within each cycle of the original parameters. We observe numerically that, for large-scale models, the number of iterations needed to converge greatly offsets the additional computational cost for each coordinate cycle. As a result, the overall convergence time is better than that of the PX-CD.
From a theoretical point of view, we show that the accumulation points of the iterative sequence generated by both the PX-CD and PXI-CD are coordinate-wise minimum points of (2).
We remark that the improved efficiency of the PX-CD algorithm is similar to the well-known superior performance of the PX-DA (parameter-expanded data-augmentation) algorithm [14,15] in the Markov-chain Monte Carlo (MCMC) context—namely, the PX-DA algorithm is often much faster to converge than a basic data-augmentation Gibbs algorithm. This similarity is also the reason for using the same prefix “PX” in our nomenclature.
The remainder of the paper is structured as follows. In Section 2, we describe the basic implementation of CD and provide examples for which it performs unsatisfactorily. In Section 3, we introduce the PX-CD and PXI-CD and show that accumulation points of the iterations are coordinate-wise minima for the optimization. We then discuss their practical implementation and detail how to reduce the computational cost when the V i are low-rank. We also extend the PX-CD algorithm for penalized estimation to perform variable selection. Then, in Section 4, we provide numerical results when V i are computer simulated and when V i are constructed from a real-world genetic data set. We have made our code for these simulations available on GitHub (https://github.com/anantmathur44/CD-for-variance-components) (accessed on 1 July 2022).

2. Basic Coordinate Descent

Recall that, after the REML procedure to remove the mean effect, we have Y N ( 0 , } } Ω ) and Ω = i = 0 m γ i V i , where V 0 = I n . We thus seek to compute:
γ ^ = arg min γ G ( Ω ) , γ i 0 , i = 0 , 1 , m
G ( Ω ) : = y Ω 1 y + ln det Ω .
CD can be applied to solve this problem by successively minimizing G along coordinate directions. There is significant scope for variation in the way components are selected to be updated. In the most conventional way the algorithm cycles through the parameters in the order γ 0 γ 1 γ m and updates each in turn. This version, known as cyclic coordinate descent, is shown in Algorithm 1.    
Algorithm 1 Cyclic CD for G ( Ω ) .
Algorithms 15 00354 i001
    If we choose to minimize along one coordinate at a time, then the update of a parameter consists of a line search along the selected coordinate direction—that is, if the selected parameter is component k, then the new parameter value is updated as
γ k ( t + 1 ) arg min x 0 G i = 0 k 1 γ i ( t + 1 ) V i + i = k + 1 m γ i ( t ) V i + x V k .

2.1. Implementation

 The non-trivial component of Algorithm 1 is to compute the line search,
arg min x 0 G ( Ω k + x V k ) ,
where Ω k : = Ω γ k V k . When Ω k is invertible, this step can be simplified to a one-dimensional algebraic expression that can be numerically solved without repeated evaluations of the terms y ( Ω k + x V k ) y and ln det ( Ω k + x V k ) . The simplification is achieved using the Generalized Eigenvalue Decomposition (GEV) [13] to decompose ( V k , Ω k ) ( D k , U k ) such that,
V k U k = Ω k U k D k and U k Ω k U k = I n ,
where D k R n × n is a diagonal matrix with non-negative entries and U k R n × n is invertible. Using the above expressions we obtain the factorized expression, Ω k + x V k = U k I + x D U k 1 . Therefore, we can express the inverse and log-determinant terms in G ( Ω k + x V k ) as,
Ω k + x V k 1 = U k ( I n + x D k ) 1 U k ,
ln det Ω k + x V k = ln det I n + x D k 2 ln det U k .
From (4), we have that,
G ( Ω k + x V k ) = y U k ( I n + x D k ) 1 U k y + ln det ( I n + x D k ) + const .
Let α k = U k y and d k = diag ( D k ) . Then, the function to be minimized at the k-th component is of the form,
g k ( x ) : = G ( Ω k + x V k ) = j = 1 n α k , j 2 d k , j x + 1 + ln ( 1 + d k , j x ) + const .
where d j 0 for all j. Unless n = 1 , in general, there is no closed-form expression for the minimum. We thus resort to a numerical method, such as Newton’s method or the Golden-section search method. With the above simplification the majority of the cost is attributed to the GEV, which has a time complexity of O ( 14 n 3 ) . Alternatively, one could employ iterative methods without prior simplification of g k ( x ) via the GEV. In that case, however, evaluation of g k ( x ) and its derivatives at each step requires one full Cholesky factorization, costing O ( n 3 ) . Either way, for problems where n is large, basic CD is too costly per one update.

2.2. Convergence

An interesting question is whether the sequence generated in Algorithm 1 converges to a local minimum of the objective function (3) (which is assumed to have a global minimum). To make a general point about the convergence theory, take C to be the set of limit points of a coordinate-descent sequence { γ ( t ) } . It is well-known [16,17] that the following existence and uniqueness assumption:
g k ( x ) = G ( Ω k + x V k ) has   a   unique   ( global )   minimizer   for x 0
is one of the simplest sufficient conditions that ensures the set C is not empty and contains only singletons. Assuming that the existence and uniqueness assumption holds for a given coordinate-descent algorithm and γ * C then, by construction, γ * is a global coordinate-wise minimum of (3)—that is, one cannot reduce the value of the objective function by moving along each of the coordinate directions (that is, G ( Ω k + x V k ) G ( Ω k + γ k * V k ) for each k and all x 0 ).
Even if γ * C is a coordinate-wise minimum, it may not be a local minimum of (3). It may well be a saddle point of (3). For example, the function f ( γ 1 , γ 2 ) = γ 1 2 + γ 2 2 5 γ 1 γ 2 does not have a local minimum at zero ( f ( γ , γ ) = 3 γ 2 indicating that a minimum does not even exist); however, it has a coordinate-wise minimum at ( γ 1 , γ 2 ) = ( 0 , 0 ) .
Thus, it is important to remember that the set of local minima of the optimization problem (3) is a subset of C because C may contain coordinate-wise minima, which are still saddle points of (3). One positive aspect is that the saddle points found by any coordinate-descent algorithm (under the existence and uniqueness assumption) are a subset of all the saddle points of (3) because they are constrained to be coordinate-wise minima. Stated another way, the set C consists of either local minimizers or saddle points that look like coordinate-wise minimizers.
Either way, there is simply no guarantee that any of our coordinate-descent procedures in this paper will converge to a strict local minimum of (3), see [16,18] for more in-depth discussions. Of course, this is an issue for any of the existing optimization algorithms for (3) (MM and EM algorithms simply being a special case of coordinate descent, and Newton’s method known for convergence only when initialized near a local minimum), and thus it should not be viewed as a particular disadvantage of our proposals.
Unfortunately, the existence and uniqueness assumption (8) cannot be used to deal with the convergence of the basic CD Algorithm 1. This is because g k ( x ) can exhibit multiple local minima, as illustrated next.
Suppose n = 2 , then from Equation (7), we have
g k ( x ) = α k , 1 2 d k , 1 x + 1 + ln ( 1 + d k , 1 x ) + α k , 2 2 d k , 2 x + 1 + ln ( 1 + d k , 2 x ) + const .
In Figure 1, we observe two minimizers for g k ( x ) when α k = ( 1.2 , 3 ) and d k = ( 10 , 0.2 ) , respectively. This implies that the existence and uniqueness assumption does not hold for the basic implementation of CD, and we cannot ensure that accumulation points of the sequence { γ ( t ) } are coordinate-wise minima of G.
Example 1 (Sufficient Conditions for a Unique Minimum).
In this example, we show that strong conditions are needed for the existence and uniqueness assumption to hold, making the basic CD Algorithm 1 less than attractive. Suppose δ : = 1 n j = 1 n α k , j 2 > 1 and there is a constant d > 0 such that d k , j = d for all j = 1 , , n . In that case, we have
g k ( x ) = n δ d x + 1 + n ln ( 1 + d x ) + const .
Therefore, the first and second derivatives of g k ( x ) are, respectively, given by
g k ( x ) = n d ( 1 + d x ) 1 δ ( 1 + d x ) and g k ( x ) = n d 2 ( 1 + d x ) 2 1 2 δ ( 1 + d x ) .
Since d > 0 , it easy to see that the equation g k ( x ) = 0 has a unique (positive) solution, which is x 1 * = δ 1 d . Similarly, the solution of g k ( x ) = 0 is x 2 * = 2 δ 1 d , which is greater than x 1 * .
Since g k ( x ) > 0 for every x [ 0 , x 2 * ) , g k ( x ) is (strictly) convex over [ 0 , x 2 * ) . As a result, since 0 < x 1 * < x 2 * , g k ( x ) exhibits a global minimum at x 1 * .

3. Parameter-Expanded CD

Since the basic CD Algorithm 1 is both expensive per coordinate update and is not amenable to standard convergence analysis [16,17], we consider an alternative called the parameter-expanded CD or PX-CD. We argue that our novel coordinate-descent algorithm is both faster to converge and also amenable to simple convergence analysis because the existence and uniqueness assumption holds. This constitutes our main contribution.
In the PX-CD, we use the supporting hyper-plane (first-order Taylor approximation) to the concave matrix function f ( A ) = ln det A , where A 0 . The supporting hyper-plane gives the bound [9]:
ln det A ln det C + tr C 1 ( A C ) ,
where C R n × n is an arbitrary PSD matrix, and equality is achieved if and only if C = A . Replacing the log-determinant term in G with the above upper bound, we obtain the surrogate function,
H ( Ω , C ) : = y Ω 1 y + i = 0 m γ i tr ( C 1 V i ) + ln det ( C ) n G ( Ω ) .
The surrogate function H has C as an extra variable in our optimization, which we set to be of the form:
C = i = 0 m γ ˜ i V i ,
where γ ˜ = γ ˜ 0 , γ ˜ 1 , , γ ˜ m are latent parameters. Similar to the MM algorithmic recipe [9], we then jointly minimize the surrogate function H with respect to both γ and γ ˜ using CD.
The most apparent way of selecting our coordinates is to cyclically update in the order:
γ 0 γ 1 γ m γ ˜ ,
where the last update is a block update of the entire block γ ˜ . In other words, the expanded parameters γ ˜ are treated as a block of coordinates that is updated in each cycle by searching through the coordinate hyper-plane rather than the single-coordinate directions. We refer to a full completion of updates in a single ordering as a “cycle” of updates. Suppose the initial guess for the parameters are ( γ ( 0 ) , γ ˜ ( 0 ) ) , then, at the end of cycle t, we denote the updated parameters as ( γ ( t ) , γ ˜ ( t ) ) . In Theorem 1, we state that under certain conditions, the sequence { ( γ ( t ) , γ ˜ ( t ) ) } t 0 generated by PX-CD has limit-points, which are coordinate-wise minima for G.
Let Ω ( t ) = i = 0 m γ i ( t ) V i be the updated covariance matrix after the m + 1 original parameters have been updated in cycle t. Then, as the inequality in (9) achieves equality if and only if C = Ω the update for the expanded block parameters γ ˜ in cycle t is,
γ ˜ ( t ) = arg min γ ˜ H ( Ω ( t ) , i = 0 m γ ˜ i V i ) = γ ( t ) .
In practice, we simply store C ( t ) = Ω ( t ) at the end of each cycle.
Minimizing H with respect to the k-th component of the original parameter γ yields a function of the form:
h k ( x ) : = H ( Ω k + x V k , C ) = y ( Ω k + x V k ) 1 y + x tr C 1 V k + const . , x 0 .
One of the main advantages of the PX-CD procedure over the basic coordinate descent in Algorithm 1 is that the optimization along each coordinate has a unique minimum.
Lemma 1.
h k ( x ) has a unique minimizer for x 0 .
Proof. 
We now show that on [ 0 , ) , the function h k ( x ) is either strictly convex or a linear function with a strictly positive gradient.
We first consider the case where Ω k is invertible. From [13], we have the GEV decomposition: ( V k , Ω k ) ( D k , U k ) where D k R n × n is a diagonal matrix with non-negative entries and U k R n × n is invertible. In a similar fashion to the simplified basic CD expression (7), let α k = U k y and d k = diag ( D k ) . Then, h k ( x ) can be simplified to,
h k ( x ) = j = 1 n α k , j 2 d k , j x + 1 + x tr C 1 V k + const .
We then obtain the first and second derivatives,
h k ( x ) = j = 1 n α k , j 2 d k , j ( d k , j x + 1 ) 2 + tr C 1 V k , h k ( x ) = j = 1 n 2 α k , j 2 d k , j 2 ( d k , j x + 1 ) 3 .
where d j 0 . If there exists j such that α k , j d k , j 0 , then h ( x ) > 0 for x 0 . Then, h is strictly convex and attains a unique global minimizer x * [ 0 , ) . Suppose that α k , j d k , j = 0 for j = 1 , , n then h k ( x ) = tr C 1 V k . If we can show that tr C 1 V k > 0 , then h k ( x ) is strictly increasing on [ 0 , ) , and x * = 0 is the unique global minimizer for x [ 0 , ) .
We note that the matrix C 1 is positive-definite since C = i = 0 m γ ˜ i V i is invertible and positive semi-definite. Therefore, the symmetric square root factorization, C 1 = C 1 / 2 C 1 / 2 exists, and
tr C 1 V k = tr C 1 / 2 V k C 1 / 2 ,
due to the invariant cyclic nature of the trace. The matrix C 1 2 V k C 1 2 is positive semi-definite as z C 1 / 2 V k C 1 / 2 z = V k 1 / 2 C 1 / 2 z 2 2 0 for all z R n . Since C 1 2 V k C 1 2 is a non-zero matrix and positive-semi-definite, tr C 1 V k > 0 and h k ( x ) has a strictly positive slope, which implies that x * = 0 is the unique global minimizer for x [ 0 , ) .
Consider the case dim ( span { V 1 , , V m } ) = r < n . Assuming γ 0 > 0 , then Ω k will be invertible except when k = 0 . When Ω 0 is singular, a simplified expression in the form of (11) may be difficult to find. Instead, we take the singular value decomposition (SVD) of the symmetric matrix Ω 0 ,
Ω 0 = Q Λ Q , Λ = diag ( λ 1 , , λ r ) 0 0 0 ,
where λ 1 , , λ r > 0 are the real-positive eigenvalues of Ω k , and the matrix Q R n × n is orthogonal. Then, we can express the inverse as
Ω 0 + x I 1 = Q ( Λ + x I n ) 1 Q .
If we assume y span { V 1 , , V m } , then α = Q y 0 . Then,
h 0 ( x ) = j = 1 r α j 2 λ j + x + j = r + 1 n α j 2 x 1 + x tr C 1 V k + const .
and
h 0 ( x ) = j = 1 r 2 α j 2 ( λ j + x ) 3 + j = r + 1 n 2 α j 2 x 3 > 0 ,
when x > 0 . Therefore, h 0 ( x ) still attains a unique minimizer as the function is strongly convex for x > 0 .   □
The result of this lemma ensures the existence and uniqueness condition (8), and thus we can ensure that accumulation or limit points of the CD iteration are also coordinate-wise minimal points. The details of the optimization follow.

3.1. Univariate Minimization via Newton’s Method

Unlike the basic CD Algorithm 1, for which each coordinate update costs O ( n 3 ) , here we show that a coordinate update for the PX-CD algorithm costs only O ( j n 2 ) for some constant j where typically j n .
The function h k ( x ) can be minimized via the second-order Newton’s method, which numerically finds the root of h k ( x ) . The basic algorithm starts with an initial guess x 0 of the root, and then
x n + 1 = x n h k ( x n ) [ h k ( x n ) ] 1 ,
are successive better approximations. The algorithm can be terminated once successive iterates are sufficiently close together. The first and second derivatives of h are given as
h k ( x ) = y ( Ω k + x V k ) 1 V k ( Ω k + x V k ) 1 y + tr C 1 V k ,
h k ( x ) = 2 y ( Ω k + x V k ) 1 V k ( Ω k + x V k ) 1 V k ( Ω k + x V k ) 1 y ,
where we used differentiation of a matrix inverse, which implies that
( Ω k + x V k ) 1 x = ( Ω k + x V k ) 1 ( Ω k + x V k ) x ( Ω k + x V k ) 1 .
Similar to the basic CD implementation, computing the algebraic expression in (11) via GEV is expensive. Evaluating (15) and (16) by explicitly calculating ( Ω k + x V k ) 1 is also expensive for large n and is of time complexity O ( n 3 ) . Instead, we utilize the conjugate gradient (CG) algorithm [13] to efficiently solve linear systems. At each iteration of Newton’s method, we approximately solve,
( Ω k + x V k ) b = y , ( Ω k + x V k ) c = V k b ,
and store the solution in b and c , respectively, via CG algorithm. Generally, b ( Ω k + x V k ) 1 y and c ( Ω k + x V k ) 1 V k b can be made small with l n iterations, where each iteration requires a matrix-vector-multiplication operation with a n × n matrix. The CG algorithm has complexity O ( l n 2 ) and can be easily implemented with standard Linear Algebra packages. With the stored approximate solutions, we evaluate the first and second derivatives as
h k ( x ) = b V k b + tr C 1 V k , h k ( x ) = 2 b V k c .
Before initiating Newton’s method, we can check if k is in the active constraint set { k : γ ^ k = 0 } . Following from Lemma 1, if h k ( 0 ) 0 , then h k ( x ) is non-decreasing on [ 0 , ) . Then, h k ( 0 ) is the global minimum for x [ 0 , ) , and we let γ i ( t + 1 ) = 0 if we are in cycle t + 1 of PX-CD. If h k ( 0 ) < 0 , we initiate Newton’s method at the current value of the variance component, x 0 = γ k ( t ) . If dim ( span { V 1 , , V m } ) < n , we require γ 0 > 0 so that Ω is invertible.
In this case, k = 0 cannot be in the active constraint set and we immediately initiate Newton’s method at the starting point x 0 = γ 0 ( t ) . In rare cases h k ( x ) is sufficiently flat at x n and (14) may significantly overstep the location of the minimizer and return an approximation x n + 1 < 0 . In this case, we dampen the step size until x n + 1 > 0 .

3.2. Updating Regime

We now consider an alternative to the cyclic ordering of updates. Suppose we update the block γ ˜ after every co-ordinate update—that is, the updating order of one complete cycle is
γ 0 γ ˜ γ 1 γ ˜ γ m γ ˜ .
This ordering regime satisfies the “essentially cyclic” condition whereby, in every stretch of 2 ( m + 1 ) updates, each component is updated at least once. We refer to CD with this ordering as parameter expanded-immediate coordinate descent (PXI-CD).
In practice, this ordering implies updating the matrix C after every update made to each γ k . Since the expression for h k ( x ) requires one to evaluate tr C 1 V k and C is updated after every coordinate, we must re-compute C 1 for each k. This implies that each cycle in PXI-CD will be more expensive than PX-CD. However, we observe that in situations where V i is full rank and n is sufficiently large, the number of cycles needed to converge is significantly less than that required for PX-CD and basic CD.
This results in PXI-CD being the most time-efficient algorithm when the scale of the problem is large. In Section 3.3, we show that, when V i is low-rank, re-computing tr C 1 V k comes at no additional-cost through the use of the Woodbury matrix identity. However, in this particular scenario, where V k are low-rank, the performance gain from PXI-CD is not as significant as when V i are full rank, and both PXI-CD and PX-CD show similar performance. Algorithm 2 summarizes both PX-CD and PXI-CD methods to obtain γ ^ .
Algorithm 2 PX-CD and PXI-CD for G ( Ω )
Algorithms 15 00354 i002
As mentioned previously, the novel parameter-expanded coordinate-descent algorithms, PX-CD and PXI-CD, are both amenable to standard convergence analysis.
Theorem 1 (PX-CD and PXI-CD Limit Points).
For both PX-CD and PXI-CD in Algorithm 2, let { γ ( t ) , γ ˜ ( t ) } t 0 be the coordinate-descent sequence. Then, either G ( k γ k ( t ) V k ) , or every limit-point of { γ ( t ) } t 0 is a coordinate-wise minimum of (3). If we further assume that y span { V 1 , , V m } < n , then the sequence { γ ( t ) } t 0 is bounded and G ( k γ k ( t ) V k ) is ruled out.
Proof. 
Recall that Ω = i = 0 m γ i V i and C = i = 0 m γ ˜ i V i . Denote ( x 1 , , x m + 1 ) : = γ and x m + 2 : = γ ˜ R m + 1 as well as x : = ( x 1 , , x m + 1 , x m + 2 ) R 2 ( m + 1 ) . We can rewrite the optimization problem (3) in the penalized form:
f ( x ) : = f 0 ( x ) + k = 1 m + 1 f k ( x k ) + f m + 2 ( x m + 2 ) ,
where f 0 ( x ) : = H ( Ω , C ) and
f k ( x ) : = 0 , x 0 , x < 0 , k = 1 , , m + 1 , f m + 2 ( x ) : = 0 , x [ 0 , ] , x [ 0 , ) .
We can then apply the results in [18], which state that every limit point of { γ ( t ) } t 0 is a coordinate-wise minimum provided that:
  • Each function x k f ( x ) , k = 1 , , m + 1 and x m + 2 f ( x ) has a unique minimum. For k = 1 , , m + 1 this has already been verified in Lemma 1. For x m + 2 f ( x ) , we simply recall that H ( Ω , C ) G ( Ω ) with equality achieved if and only if C = Ω , or equivalently x m + 2 = ( x 1 , , x m + 1 ) .
  • Each f k , k = 1 , , m + 2 is lower semi-continuous. This is clearly true for k = 1 , , m + 1 because, at the point of discontinuity, we have lim inf x 0 f k ( x ) f k ( 0 ) = 0 . For f m + 2 , we simply check that { x : f m + 2 ( x ) c } for a given c R is a closed set.
  • The domain of f 0 is a Cartesian product and f 0 is continuous on its domain. Clearly, the domain is the 2 ( m + 1 ) Cartesian product D : = [ 0 , ) × × [ 0 , ) and f 0 is continuous on its effective domain { x D : f ( x ) < } .
  • The updating rule is essentially cyclic—that is, there exists a constant T m + 2 such that every block in ( x 1 , , x m + 1 , x m + 2 ) is updated at least once between the r-th iteration and the ( r + T 1 ) -th iteration for all r. In our case, each block is updated at least once in one iteration of Algorithm 2 so that we satisfy the essentially cyclic condition. In the PXI-CD, we actually update ( m + 1 ) times the block x m + 2 .
Thus, we can conclude by Proposition 5.1 in [18] that either f 0 ( γ ( t ) ) or the limit points of { γ ( t ) } t 0 are coordinate-wise minima of H ( Ω , C ) .
If we further assume that y span { V 1 , , V m } < n , then we can show that set { γ 0 : f 0 ( γ ) f 0 ( γ ( 0 ) ) } is compact, which ensures that the sequence { γ ( t ) } t 0 is bounded and rules out the possibility that lim t f 0 ( γ ( t ) ) = . To see that this is the case, note that G ( Ω ) provides a lower bound to H ( Ω , C ) , and this is sufficient to show that { γ 0 : G ( k γ k V k ) c } is compact for any c R under the assumption that V 0 : = I and y span { V 1 , , V m } < n . However, these are precisely the conditions of Lemma 3 in [9], which ensure that { γ 0 : L ( γ ) c } is compact for a likelihood L ( γ ) : = G ( k γ k V k ) .
Finally, note that since we update the entire block γ ˜ simultaneously (in both PX-CD and PXI-CD), this means that a coordinate-wise minimum of H ( Ω , C ) is also a coordinate-wise minimum for G ( Ω ) .   □
We again emphasize that, as with all the competitor methods, the theorem does not guarantee convergence of the coordinate-descent sequence { γ ( t ) } t 0 or that the convergence will be to a local minimum. The only thing we can say for sure is that, when the sequence converges, then the limit will be a coordinate-wise minimum (which could be a saddle point in some special cases). Nevertheless, our numerical experience in Section 4 is that the sequence always converges to a coordinate-wise minimum and that the coordinate-wise minimum is in fact a (local) minimum.

3.3. Linear Mixed Model Implementation

We now show that, for Linear Mixed Models (LMM), we can reduce the computational complexity of each sub-problem to O ( j d 2 ) , for some constants j n and d < n . As shown in Section 3.1, solving each univariate sub-problem can be simplified to implementing Newton’s method, where at each Newton’s update, we solve two n-dimensional linear systems.
In settings where rank ( V i ) < n , for i = 1 , , m , we are able to reduce the dimensions of the linear system that is required to be solved. To see this, let us first specify the general variance-component model (also known as the general mixed ANOVA model) [19]. Suppose,
y = X β + Z 1 b 1 + + Z m b m + ϵ ,
where X is an n × p matrix of known fixed numbers, p n ; β is an p × 1 vector of unknown constants; Z i is a full-rank n × c i matrix of known fixed numbers, c i n ; b i is an c i × 1 vector of independent variables from N ( 0 , γ i ) , which are unknown and ϵ is an n × 1 vector of independent errors from N ( 0 , γ 0 ) that are unknown. In this setup, γ 0 , , γ m , are the variance component parameters that are to be estimated, V i = Z i Z i and Ω = γ 0 I + Z Σ Z , where
Z = Z 1 Z 2 Z m , Σ = block diag ( γ 1 I c 1 , , γ m I c m ) .
We now provide two methods that take advantage of V i being low-rank. Let,
c : = i = 1 m rank ( V i ) = i = 1 m c i .
In the first method, we utilize a QR factorization that reduces the computational complexity when c < n , i.e., the column rank of Z is less than n. In the second method, we use the Woodbury matrix identity to reduce the complexity when c i < n for i = 1 , , m , i.e., the column rank of each of the matrices Z i is less than n.

3.3.1. QR Method

The following QR factorization can be viewed as a data pre-processing step that allows all PX-CD and PXI-CD computations to be c-dimensional instead of n-dimensional. The QR factorization only needs to be computed once initially with a cost of O ( c n 2 ) operations. Let the QR factorization of Z R n × c be
Z = Q [ c ] Q [ n c ] Q R 0 , R = R 1 R 2 R m ,
where R is an c × c upper triangular matrix, 0 is an ( n c ) × c zero matrix, Q [ c ] is an n × c matrix, Q [ n c ] is n × ( n c ) , and Q [ c ] and Q [ n c ] both have orthogonal columns. The matrix R is partitioned such that the number of columns in R i is equal to the number of columns in Z i . Let y ˜ = [ y ˜ [ c ] , y ˜ [ n c ] ] = Q y , where y ˜ [ c ] are the first c elements of y ˜ and y ˜ [ n c ] are the last n c elements of y ˜ . Then,
H ( Ω , C ) = y ˜ [ c ] Ω ˜ 1 y ˜ [ c ] + i = 0 m γ i tr ( C ˜ 1 V ˜ i ) + ln det ( C ) n + α ,
where we define: V ˜ i : = R i R i , V ˜ 0 : = I c , Ω ˜ : = i = 0 m γ i V ˜ i , C ˜ : = i = 0 m γ i ˜ V ˜ i and α : = γ 0 γ 0 ˜ 1 ( n c ) + γ 0 1 y ˜ [ n c ] y ˜ [ n c ] . The details of this derivation are provided in the Appendix A. To implement PX-CD or PXI-CD after this transformation we run Algorithm 2 with inputs V ˜ 0 , , V ˜ m R c × c and y ˜ [ c ] R c × 1 . In this simplification of H, we have the additional term α , which is dependent on γ 0 . Therefore, when we update the parameter γ 0 and implement Newton’s method we must also add
α γ 0 = γ 0 ˜ 1 ( n c ) γ 0 2 y ˜ [ n c ] y ˜ [ n c ] and 2 α γ 0 2 = 2 γ 0 3 y ˜ [ n c ] y ˜ [ n c ]
to the corresponding derivatives derived for Newton’s method in Section 3.1.

3.3.2. Woodbury Matrix Identity

Alternatively, if c is large (say c > n ) but individually c i < n , for i = 1 , , m , we can use the Woodbury identity to reduce each linear system to c k dimensions (instead of n dimensions) when updating the component γ k . Suppose we are in cycle t + 1 of either PX-CD or PXI-CD and we wish to update the parameter γ k , where k 0 , then we can simplify the optimization,
γ k ( t + 1 ) = arg min x 0 h k ( x ) , h k ( x ) = H ( Ω k + x V k , C ) ,
by viewing Ω k + x V k = Ω + ( x γ k ( t ) ) Z k Z k as a low-rank perturbation to the matrix Ω . The Woodbury identity gives the expression for the inverse,
Ω + ( x γ k ( t ) ) Z k Z k 1 = Ω 1 ( x γ k ( t ) ) Ω 1 Z k I c k + ( x γ k ( t ) ) Z k Ω 1 Z k 1 Z k Ω 1 ,
which contains the unperturbed inverse matrix Ω 1 and the inverse of a smaller c k × c k matrix. In this implementation of PXI-CD and PX-CD we re-compute and store the matrix Ω 1 after each coordinate update. Let w : = Z k Ω 1 y , then the line search along the k-th component ( k 0 ) of the function H simplifies to
h k ( x ) = ( x γ k ( t ) ) w I c k + ( x γ k ( t ) ) Z k Ω 1 Z k 1 w + x tr C 1 Z k Z k + const .
When implementing PXI-CD, there is no additional cost when using the Woodbury identity, as the update C Ω is made after every coordinate update and the trace term in the line search can be evaluated cheaply because Ω 1 is known. We can now implement Newton’s method to find the minimum of h k ( x ) . Let
B : = Z k Ω 1 Z k , M : = I c k + ( x γ k ( t ) ) B .
If we then solve the c k -dimensional linear systems M d = w and M f = B d with CG and store the solution in the vectors d and f , respectively, we can evaluate the first and second derivatives of h k ( x ) as
h k ( x ) = w d + ( x γ k ( t ) ) d B d + tr B , h k ( x ) = 2 d B d + 2 ( x γ k ( t ) ) d B f ,
and implement the Newton steps (14). The derivation of these derivative expressions are provided in Appendix B. After each coordinate k is updated we evaluate and store the updated inverse covariance matrix,
Ω 1 Ω + ( γ k ( t + 1 ) γ k ( t ) ) Z k Z k 1 ,
using (20), where we invert a smaller c k × c k matrix only. When k = 0 , no reduction in complexity can be made as the perturbation to Ω is full-rank and we update γ 0 as we did in Section 3.1. If c = i = 1 m c i < n , then we can use an alternate form of the Woodbury identity,
Ω 1 = γ 0 1 I W ( γ 0 I + W W ) 1 W ,
where W : = Z Σ 1 / 2 to update Ω 1 after γ 0 has been updated. If c > n , we invert the full n × n updated matrix covariance to obtain Ω 1 using a Cholesky factorization at cost O ( n 3 ) . This O ( n 3 ) cost for updating γ 0 is a disadvantage for this implementation if c > n ; however, numerical simulations suggest that, when c i n for i = 1 , , m , the Woodbury implementation is the fastest implementation.

3.4. Variable Selection

When the number of variance components is large, performing variable selection can enhance the model interpretation and provide more stable parameter estimation. To impose sparsity when estimating γ , a lasso or ridge penalty can be added to the negative log-likelihood [20]. The MM implementation [9] provides modifications to the MM algorithm such that both lasso and ridge penalized expressions can be minimized. We now show that, with PX-CD, we can minimize the penalized negative log-likelihood when using the · 1 penalty. Consider the penalized negative log-likelihood expression,
G ( Ω ) + λ i = 0 m γ i , γ i 0 , λ > 0 .
We then have the surrogate function,
J ( Ω , C ) : = H ( Ω , C ) + λ i = 0 m γ i .
If we use PX-CD to minimize J, we need to repeatedly minimize the one-dimensional function along each co-ordinate, j k ( x ) : = h k ( x ) + λ x + λ i k m γ i . Here, we implement Newton’s method as before with the only difference now being that the derivative is increased by the constant λ ,
j k ( x ) : = h k ( x ) + λ .
It follows from Lemma 1 that j ( x ) is either strictly convex or linear with strict positive gradient for x [ 0 , ) . We check if h k ( 0 ) + λ 0 to determine if j k ( 0 ) is the global minimizer for x [ 0 , ) . If it is, we let γ i ( t + 1 ) = 0 if we are in cycle t + 1 . If j k ( 0 ) < 0 , we initiate Newton’s method at the current value of the variance component x 0 = γ k ( t ) . The larger the parameter λ is, the greater number of times this active constraint condition will be met, and therefore more variance components will be set to zero.

4. Numerical Results

In this section, we assess the efficiency of PX-CD and PXI-CD via simulation and compare them against the best current alternative method, the MM algorithm [9]. In [9] the MM algorithm is found to be superior to both the EM and Fisher Scoring Method in terms of performance. This superior performance was also described in [21].
In our experiments, we additionally include the Expectation–Maximization (EM) and Fisher-Scoring (FS) method in the small-scale problem only, where γ 0 = 0.1 . We exclude the EM and FS method for more difficult problems, as they are too slow and unsuitable. The MM, EM and FS are executed with the Julia implementation in [9]. We provide results in three settings. First, we simulate data from the model (Section 3.3), where c i < n , i.e., when the matrices V i are low-rank. Secondly, we simulate when c i = n , i.e., the matrices V i are full-rank, and finally, we simulate data from model (Section 3.3), where the matrices V i are constructed from a real data-set containing genetic variants of mice.

4.1. Simulations

For the following simulations, we simulate data from the model (Section 3.3). Since the fixed effects β can always be eliminated from the model using the REML, we focus solely on the estimation of the variance component parameters. In other words, the value of β in our simulations is irrelevant. In each simulation, we generate the fixed matrices V i as
V i = j = 1 r Z i , j Z i , j j = 1 r Z i , j Z i , j F ,
where Z i , j N ( 0 , I n ) and · F is the Frobenius matrix norm. The rank of each V i is equal to the parameter r, which we vary.
In each simulation, for k 0 , we draw the m true variance components as γ k = ( 1 + ρ ) 2 where ρ N ( 0 , 1 ) . Then, we simulate the response from y N ( 0 , i = 0 m γ i V i ) and estimate the vector γ ^ . We vary the value of γ 0 from the set { 0.1 , 1 , 10 } and keep n = 1000 .

4.1.1. Low-Rank

We now present the results for where V i are generated as stated above and r < n . As V i are low-rank, we run the Woodbury implementation of PXI-CD and exclude PX-CD as it has the same computational cost as PXI-CD for each update and exhibits almost identical performance. First, PXI-CD is run until the relative change L ( γ ( t + 1 ) ) L ( γ ( t ) ) / L ( γ ( t ) ) + 1 is less than 10 10 and we store the final objective value as L * . We then run the algorithms MM, EM and FS and terminate once L ( γ ( t ) ) > L * . We initialize all algorithms to start from the point γ ( 0 ) = 1 . Each simulation scenario is replicated 10 times. The mean running time is reported along with the standard error of the mean running time provided in parentheses.
The results of the low-rank simulations are given in Table 1, Table 2 and Table 3. The results indicate that, apart from the smallest scale problem when r = m = 20 , our PXI-CD algorithm outperforms the MM algorithm and significantly so that the scale of the model (both m and r) increases.

4.1.2. Full-Rank

We now present the results when rank ( V i ) = n = 1000 . We implement the standard PX-CD, PXI-CD and MM algorithms, where either the Woodbury, or the QR method cannot be used. Initially, PX-CD is run until the relative change L ( γ ( t + 1 ) ) L ( γ ( t ) ) / L ( γ ( t ) ) + 1 is less than 10 10 , and we store the final objective value as L * . The other algorithms terminate once L ( γ ( t + 1 ) ) > L * . For the following simulations, we consider one iteration of a CD algorithm as a single cycle of updates. Each simulation scenario is replicated 10 times. The mean running time and mean iteration number is reported with the standard error of the mean running time and mean iteration provided in parentheses.
The results of the full-rank simulations are given in Table 4, Table 5 and Table 6. PX-CD and PXI-CD both significantly outperform the MM and the basic CD algorithms in these examples. We observe that as the number of components m increases the problem becomes increasingly difficult for the MM algorithm. An intuitive explanation for this performance gap is that the CD algorithms are able to identify the active constraint set { k : γ ^ k = 0 } in only a few cycles.
When γ 0 = 0.1 and γ 0 = 1 and m is large ( m = 50 , m = 100 ), PXI-CD is the fastest algorithm, even though it is computationally the most expensive per cycle. When γ 0 = 10 and m = 100 , PXI-CD is the fastest algorithm. In fact, as the problem size grows, the number of iterations that PXI-CD requires to converge is less than that of the basic CD. This simulation indicates that PXI-CD is well-suited to problems with large m , n and when V i are full-rank. The basic CD algorithm, while numerically the inferior compared to the PX-CD and PXI-CD algorithms still outperforms the MM algorithm in these simulations.

4.2. Genetic Data

We now present simulation results when Z i are constructed from the https://openmendel.github.io/SnpArrays.jl/latest/#Example-data (accessed on 1 July 2022) mouse single nucleotide polymorphism (SNP) array data set available from the Open Mendel project [21]. The dataset consists of Z , an n × c matrix consisting of c genetic variants for n individual mice. For this experiment, c = 10,200 and n = 500 . We artificially generate m different genetic regions by partitioning the columns of Z into Z i = 1 , , m gene matrices, where Z i R n × r . Then, we can compose our fixed matrices V 1 , , V m as,
V i = Z i Z i j = 1 m Z j Z j F .
We simulate γ and y as we did in Section 4.1. In this case, y mimics a vector of quantitative trait measurements of the n mice. This data set is well-suited for testing our method when m is large ( m > n ). In these cases, we observe that, when initialized at the same point, the MM and PXI-CD method converge to different solutions, i.e., they may converge to different stationary points. Therefore, we run all algorithms until the relative change L ( γ ( t + 1 ) ) L ( γ ( t + 1 ) ) / L ( γ ( t ) ) + 1 is less than 10 10 . Since r < n , we implement PXI-CD utilizing the Woodbury identity. Each simulation scenario is replicated 10 times. The mean running time and mean iteration number are reported with the standard error of the mean running time and mean iteration provided in parentheses.
The results of the genetic study simulation are provided in Table 7. We observe that PXI-CD outperforms the MM algorithm for all values of m and r for this data set in both the number of iterations and running time until convergence. When m > n , we observe that the MM and PXI-CD method converge to noticeably different objective values. We suspect that this is because when m > n the likelihood in (2) exhibits many more local minima. On average, PXI-CD converges to a more optimal stationary point when m is large and m > n .

5. Conclusions

The MLE solution for variance-component models requires the optimization of a non-convex log-likelihood function. In this paper, we showed that a basic implementation of the cyclic CD algorithm is computationally expensive to run and is not amenable to traditional convergence analysis.
To remedy this, we proposed a novel parameter-expanded CD (PX-CD) algorithm, which is both computationally faster and also subject to theoretical guarantees. PX-CD optimizes a higher-dimensional surrogate function that attains a coordinate-wise minimum with respect to each of the variance component parameters. The extra speed is derived from the fact that required quantities (such as first and second-order derivatives) are evaluated via the conjugate-gradient algorithm.
Additionally, we propose an alternative updating regime called PXI-CD, where the expanded block parameters are updated immediately after each coordinate update. This new updating regime requires more computation for each iteration as compared to PX-CD. However, numerically, we observed that, for large-scale models, where the number of variance components m + 1 is large and V i are full-rank, the number of iterations needed to converge greatly offsets the additional computational cost per coordinate update cycle.
Our numerical experiments suggest that PX-CD and PXI-CD outperform the best current alternative—the MM algorithm. When the number of variance components m is large, we observed that PXI-CD was significantly faster than the MM algorithm and tended to converge to more optimal stationary points.
A potential extension of this work is to apply parameter-expanded CD algorithms to the multivariate-response variance-component model. Instead of the univariate response, one considers the multivariate response model with a n × d response matrix Y . In this setup, E Y = X B where B is a p × d matrix. The n d × n d covariance matrix is of the form
Ω = cov ( vec ( Y ) ) = i = 0 m Γ i V i ,
where Γ i are unknown d × d variance components and V i are the known n × n covariance matrices. The challenging aspect of this problem is that each optimization with respect to a parameter Γ i is not univariate but is rather a search over positive semi-definite matrices—itself, a difficult optimization problem.

Author Contributions

Conceptualization, A.M., S.M. and Z.B.; methodology, A.M., S.M. and Z.B.; software, A.M.; formal analysis, A.M., S.M. and Z.B. data curation, A.M.; writing—original draft preparation, A.M.; writing—review and editing, A.M., S.M. and Z.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://openmendel.github.io/SnpArrays.jl/latest (accessed on 1 July 2022) mouse single nucleotide polymorphism (SNP) array data set.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CDCoordinate descent
PX-CDParameter expanded coordinate descent
PXI-CDParameter expanded-immediate coordinate descent

Appendix A. QR Method

In this section, we provide further details on the QR factorization used in Section 3.3.1. Recall that we have the decomposition,
Z = Q [ c ] Q [ n c ] Q R O , R = R 1 R 2 R m ,
where Q is an orthogonal matrix and R is partitioned such that the number of columns in R i is equal to number of columns in Z i . Recall that y ˜ = [ y ˜ [ c ] , y ˜ [ n c ] ] = Q y where y ˜ [ c ] are the first c elements of y ˜ and y ˜ [ n c ] are the last n c elements of y ˜ . Then,
Ω = γ 0 I + Z Σ Z = Q γ 0 I + R 0 Σ R 0 Q = Q R Σ R + γ 0 I c 0 0 γ 0 I n c Q .
Taking the inverse of this matrix yields
Ω 1 = Q R Σ R + γ 0 I c 1 0 0 γ 0 1 I n c Q .
and
y Ω 1 y = y ˜ [ c ] Ω ˜ 1 y ˜ [ c ] + γ 0 1 y ˜ [ n c ] y ˜ [ n c ] .
We now consider the simplifications for the trace terms in H. Let
Σ ˜ = block diag ( γ ˜ 1 I c 1 , , γ ˜ m I c m ) .
Then, C = γ ˜ 0 I + Z Σ ˜ Z and
C 1 = Q R Σ ˜ R + γ ˜ 0 I c 1 0 0 γ ˜ 0 1 I n c Q .
If we substitute this expression into the trace term when k 0 , we obtain
tr C 1 V k = tr Z k C 1 Z k = tr Z k Q R Σ ˜ R + γ ˜ 0 I c 1 0 0 γ ˜ 0 1 I n c Q Z k = tr R k 0 R Σ ˜ R + γ ˜ 0 I c 1 0 0 γ ˜ 0 1 I n c R k 0 = tr R Σ ˜ R + γ ˜ 0 I c 1 R k R k .
When k = 0 , we have
tr C 1 V 0 = tr Q R Σ ˜ R + γ ˜ 0 I c 1 0 0 γ ˜ 0 1 I n c Q = tr R Σ ˜ R + γ ˜ 0 I c 1 0 0 γ ˜ 0 1 I n c = tr R Σ ˜ R + γ ˜ 0 I c 1 + γ ˜ 0 1 ( n c ) .
If we recall the definitions V ˜ i : = R i R i , V ˜ 0 : = I c , Ω ˜ : = i = 0 m γ i V ˜ i and C ˜ : = i = 0 m γ i ˜ V ˜ i and combine Equations (A1)–(A3), we obtain
h ( Ω , C ) = y Ω 1 y + i = 0 m γ i tr ( C 1 V i ) + ln det ( C ) n = y ˜ [ c ] Ω ˜ 1 y ˜ [ c ] + i = 0 m γ i tr ( C ˜ 1 V ˜ i ) + ln det ( C ) n + γ 0 γ 0 ˜ 1 ( n c ) + γ 0 1 y ˜ [ n c ] y ˜ [ n c ] .

Appendix B. Woodbury Identity

Recall from Section 3.3.2 that we have the definitions w : = Z k Ω 1 y ; B : = Z k Ω 1 Z k and M : = I c k + ( x γ k ( t ) ) B and the simplified univariate function,
h k ( x ) = ( x γ k ( t ) ) w M 1 w + x tr B + const .
We now derive the first and second derivatives of this function. The differentiation of an invertible symmetric matrix implies that
M 1 x = M 1 B M 1 .
Then, from the product rule of differentiation,
h k ( x ) = ( x γ k ( t ) ) w M 1 B M 1 w w M 1 w + tr ( B ) .
If we then approximately solve the linear system M d = w with CG, then
h k ( x ) = w d + ( x γ k ( t ) ) d B d + tr B .
Using the matrix product rule of differentiation, we have that
M 1 B M 1 x = 2 M 1 B M 1 B M 1 .
Then, using (A4) and (A5) to differentiate h k ( x ) , we obtain
h k ( x ) = 2 w M 1 B M 1 w + 2 ( x γ k ( t ) ) w M 1 B M 1 B M 1 w .
If we solve M d = w and M j = B d with CG then j will approximate the matrix-vector product M 1 B M 1 w and the second derivative can be evaluated as
h k ( x ) = 2 d B d + 2 ( x γ k ( t ) ) d B j .

References

  1. Kang, H.M.; Sul, J.H.; Service, S.K.; Zaitlen, N.A.; Kong, S.y.; Freimer, N.B.; Sabatti, C.; Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010, 42, 348–354. [Google Scholar] [CrossRef] [PubMed]
  2. Searle, S.; Casella, G.; McCulloch, C. Variance Components; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
  3. Jiang, J.; Nguyen, T. Linear and Generalized Linear Mixed Models and Their Applications; Springer: New York, NY, USA, 2007; Volume 1. [Google Scholar]
  4. Harville, D.A. Maximum likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 1977, 72, 320–338. [Google Scholar] [CrossRef]
  5. Jennrich, R.I.; Sampson, P. Newton–Raphson and related algorithms for maximum likelihood variance component estimation. Technometrics 1976, 18, 11–17. [Google Scholar] [CrossRef]
  6. Longford, N.T. A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika 1987, 74, 817–827. [Google Scholar] [CrossRef]
  7. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
  8. Lindstrom, M.J.; Bates, D.M. Newton–Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. J. Am. Stat. Assoc. 1988, 83, 1014–1022. [Google Scholar]
  9. Zhou, H.; Hu, L.; Zhou, J.; Lange, K. MM algorithms for variance components models. J. Comput. Graph. Stat. 2019, 28, 350–361. [Google Scholar] [CrossRef] [PubMed]
  10. Wright, S.J. Coordinate-descent algorithms. Math. Program. 2015, 151, 3–34. [Google Scholar] [CrossRef]
  11. Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 2012, 22, 341–362. [Google Scholar] [CrossRef]
  12. Luo, Z.Q.; Tseng, P. On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 1992, 72, 7–35. [Google Scholar] [CrossRef]
  13. Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2013. [Google Scholar]
  14. Liu, J.S.; Wu, Y.N. Parameter expansion for data augmentation. J. Am. Stat. Assoc. 1999, 94, 1264–1274. [Google Scholar] [CrossRef]
  15. Meng, X.L.; Van Dyk, D.A. Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 1999, 86, 301–320. [Google Scholar] [CrossRef]
  16. Bezdek, J.C.; Hathaway, R.J. Convergence of alternating optimization. Neural Parallel Sci. Comput. 2003, 11, 351–368. [Google Scholar]
  17. Bezdek, J.; Hathaway, R. Some Notes on Alternating Optimization. In AFSS International Conference on Fuzzy Systems, Proceedings of the Advances in Soft Computing—AFSS 2002, Calcutta, India, 3–6 February 2002; Springer: Berlin/Heidelberg, Germany, 2002; Volume 2275, pp. 288–300. [Google Scholar] [CrossRef]
  18. Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 2001, 109, 475–494. [Google Scholar] [CrossRef]
  19. Hartley, H.O.; Rao, J.N. Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika 1967, 54, 93–108. [Google Scholar] [CrossRef] [PubMed]
  20. Schelldorfer, J.; Bühlmann, P.; DE GEER, S.V. Estimation for high-dimensional linear mixed-effects models using l1-penalization. Scand. J. Stat. 2011, 38, 197–214. [Google Scholar] [CrossRef]
  21. Zhou, H.; Sinsheimer, J.S.; Bates, D.M.; Chu, B.B.; German, C.A.; Ji, S.S.; Keys, K.L.; Kim, J.; Ko, S.; Mosher, G.D.; et al. OpenMendel: A cooperative programming project for statistical genetics. Hum. Genet. 2020, 139, 61–71. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Two local minima: ( 0.11 , 10.26 ) and ( 14.35 , 8.66 ) obtained for g k ( x ) when α k , 1 = 1.2 , α k , 2 = 3 , d k , 1 = 10 and d k , 2 = 0.2 .
Figure 1. Two local minima: ( 0.11 , 10.26 ) and ( 14.35 , 8.66 ) obtained for g k ( x ) when α k , 1 = 1.2 , α k , 2 = 3 , d k , 1 = 10 and d k , 2 = 0.2 .
Algorithms 15 00354 g001
Table 1. The running times (s) when V i are low-rank, n = 1000 , and γ 0 = 0.1 .
Table 1. The running times (s) when V i are low-rank, n = 1000 , and γ 0 = 0.1 .
m
Method r 20 50 75 100
PXI-CD204.20 (0.07)17.00 (1.18)25.81 (1.57)51.8 (5.09)
MM 1.89 (0.22)42.54 (7.69)220.06 (61.19)845.2 (136.89)
EM 23.30 (0.52)---
FS 91.80 (0.99)---
PXI-CD505.42 (0.16)21.36 (0.85)37.46 (2.72)40.21 (2.97)
MM 95.63 (41.20)203.24 (53.20)584.76 (53.20)1167.22 (114.38)
PXI-CD1008.96 (0.25)30.17 (2.24)28.26 (0.73)45.65 (2.56)
MM 112.58 (21.48)555.38 (82.01)699.89 (90.55)1327.54 (53.80)
PXI-CD15014.13 (0.63)33.18 (4.67)38.24 (1.54)48.97 (2.08)
MM 122.44 (38.48)628.24 (55.61)929.67 (75.87)1243.97 (84.30)
Table 2. The running times (s) when V i are low-rank, n = 1000 , and γ 0 = 1 .
Table 2. The running times (s) when V i are low-rank, n = 1000 , and γ 0 = 1 .
m
Method r 20 50 75 100
PXI-CD203.10 (0.08)9.52 (0.28)16.30 (0.43)26.05 (0.77)
MM 3.11 (0.84)117.40 (24.42)280.31 (51.43)719.32 (158.51)
PXI-CD504.08 (0.11)14.35 (0.76)29.94 (1.33)53.10 (3.7)
MM 157.98 (34.53)501.60 (90.15)546.56 (91.48)1133.83 (129.96)
PXI-CD1006.00 (0.36)25.34 (0.95)61.69 (3.38)73.23 (10.6)
MM 103.08 (31.4)544.06 (61.35)743.79 (104.67)1254.97 (87.32)
PXI-CD1509.18 (0.45)42.30 (2.08)66.13 (8.3)66.67 (10.2)
MM 176.80 (31.64)498.12 (63.39)986.87 (61.46)1110.90 (105.59)
Table 3. The running times (s) when V i are low-rank, n = 1000 , and γ 0 = 10 .
Table 3. The running times (s) when V i are low-rank, n = 1000 , and γ 0 = 10 .
m
Method r 20 50 75 100
PXI-CD203.62 (0.10)8.79 (0.20)15.60 (0.41)23.13 (0.91)
MM 10.17 (0.84)318.53 (90.49)648.64 (132.62)839.24 (158.52)
PXI-CD504.36 (0.09)14.39 (0.4)24.65 (0.97)42.07 (1.13)
MM 184.03 (38.8)473.58 (80.82)648.33 (114.98)1230.56 (128.79)
PXI-CD1006.53 (0.22)26.07(1.51)48.48 (2.48)60.72 (5.33)
MM 124.68 (19.93)511.66 (65.85)880.53 (69.53)1279.40 (81.93)
PXI-CD15010.15 (0.37)35.34 (1.72)66.94 (2.28)103.89 (10.36)
MM 199.80 (33.17)512.35 (49.97)943.84 (84.24)1244.32 (72.13)
Table 4. The convergenceresults when V i are full-rank, n = 1000 , and γ 0 = 0.1 .
Table 4. The convergenceresults when V i are full-rank, n = 1000 , and γ 0 = 0.1 .
MethodmIterationsTime (s)Objective
PX-CD2589.00 (2.90)37.47 (1.71)−1383.44
PXI-CD 146.90 (21.17)97.49 (13.58)−1383.44
CD 109.40 (17.52)286.96 (19.83)−1383.44
MM 3957.90 (300.77)281.53 (21.03)−1383.44
PX-CD50182.40 (11.34)147.21 (10.64)−1831.40
PXI-CD 73.10 (2.97)104.42 (4.64)−1831.40
CD 103.40 (6.18)852.41 (70.22)−1831.40
MM 10,240.40 (2140.25)1557.79 (343.47)−1831.40
PX-CD100279.10 (15.09)376.18 (24.14)−2143.93
PXI-CD 80.70 (2.97)211.19 (8.58)−2143.93
CD 164.00 (8.84)2060.69 (155.37)−2143.93
MM 12,171.90 (1526.30)3482.30 (465.51)−2143.93
Table 5. The convergence results when V i are full-rank, n = 1000 , and γ 0 = 1 .
Table 5. The convergence results when V i are full-rank, n = 1000 , and γ 0 = 1 .
MethodmIterationsTime (s)Objective
PX-CD2582.60 (4.05)32.00 (1.93)−1707.53
PXI-CD 172.70 (7.08)112.31 (5.15)−1707.53
CD 116.90 (6.31)279.61 (18.03)−1707.53
MM 4313.90 (347.85)303.13(24.3)−1707.53
PX-CD50192.50 (8.89)155.03 (8.11)−1957.26
PXI-CD 103.00 (18.85)147.79 (26.11)−1957.26
CD 110.20 (4.53)940.96 (58.0)−1957.26
MM 15,860.80 (2872.82)2488.70 (484.24)−1957.26
PX-CD100313.60 (13.88)423.78 (20.48)−2203.61
PXI-CD 86.80 (3.06)226.10 (8.51)−2203.61
CD 185.60 (8.33)2422.93 (143.76)−2203.61
MM 12,820.60 (2102.69)3792.45 (725.63)−2203.61
Table 6. The convergence results when V i are full-rank, n = 1000 , and γ 0 = 10 .
Table 6. The convergence results when V i are full-rank, n = 1000 , and γ 0 = 10 .
MethodmIterationsTime (s)Objective
PX-CD2575.20 (2.88)25.74 (1.46)−2603.51
PXI-CD 152.00 (6.98)95.59 (5.12)−2603.51
CD 38.50 (1.60)172.03 (11.70)−2603.51
MM 3616.90 (513.57)254.87 (36.27)−2603.51
PX-CD50143.70 (6.93)108.55 (5.89)−2668.99
PXI-CD 177.50 (28.72)249.73 (39.38)−2668.99
CD 79.80 (4.59)668.65 (40.71)−2668.99
MM 11,306.30 (1824.79)1697.89 (275.36)−2668.99
PX-CD100304.00 (12.39)412.54 (19.01)−2731.09
PXI-CD 87.40 (2.54)230.54 (7.27)−2731.09
CD 181.80 (7.63)2697.20 (150.64)−2731.09
MM 11,796.10 (1732.0)3358.14 (521.92)−2731.09
Table 7. The running times (s)—mouse data.
Table 7. The running times (s)—mouse data.
MethodmrIterationsTime (s)Objective
PXI-CD10010295.1 (9.9)23.1 (2.66)−43.6
MM 1580.2 (194.76)108.7 (15.37)−43.6
PXI-CD20051180.0 (13.41)58.8 (4.4)−82.5
MM 2797.8 (401.1)466.0 (88.95)−82.6
PXI-CD50020397.8 (53.73)258.1 (34.56)−100.3
MM 2700.8 (366.57)1055.4 (179.97)−106.6
PXI-CD100010434.7 (35.7)507.5 (39.91)−82.1
MM 3004.8 (343.61)2329.4 (296.37)−91.2
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mathur, A.; Moka, S.; Botev, Z. Coordinate Descent for Variance-Component Models. Algorithms 2022, 15, 354. https://doi.org/10.3390/a15100354

AMA Style

Mathur A, Moka S, Botev Z. Coordinate Descent for Variance-Component Models. Algorithms. 2022; 15(10):354. https://doi.org/10.3390/a15100354

Chicago/Turabian Style

Mathur, Anant, Sarat Moka, and Zdravko Botev. 2022. "Coordinate Descent for Variance-Component Models" Algorithms 15, no. 10: 354. https://doi.org/10.3390/a15100354

APA Style

Mathur, A., Moka, S., & Botev, Z. (2022). Coordinate Descent for Variance-Component Models. Algorithms, 15(10), 354. https://doi.org/10.3390/a15100354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop