Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents

Ke, Zheng Tracy; Wang, Jingming

doi:10.3390/math12111682

Open AccessFeature PaperArticle

Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents

by

Zheng Tracy Ke

^*

and

Jingming Wang

Department of Statistics, Harvard University, Cambridge, MA 02138, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(11), 1682; https://doi.org/10.3390/math12111682

Submission received: 18 April 2024 / Revised: 11 May 2024 / Accepted: 17 May 2024 / Published: 28 May 2024

(This article belongs to the Special Issue Theory and Applications of Random Matrix)

Download

Browse Figure

Versions Notes

Abstract

:

Topic modeling is a widely utilized tool in text analysis. We investigate the optimal rate for estimating a topic model. Specifically, we consider a scenario with n documents, a vocabulary of size p, and document lengths at the order N. When

N \geq c \cdot p

, referred to as the long-document case, the optimal rate is established in the literature at

\sqrt{p / (N n)}

. However, when

N = o (p)

, referred to as the short-document case, the optimal rate remains unknown. In this paper, we first provide new entry-wise large-deviation bounds for the empirical singular vectors of a topic model. We then apply these bounds to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by comparing the improved error rate with the minimax lower bound, we conclude that the optimal rate is still

\sqrt{p / (N n)}

in the short-document case.

Keywords:

decoupling inequality; entry-wise eigenvector analysis; pre-SVD normalization; sine-theta theorem; topic-SCORE; word frequency heterogeneity

MSC:

62H12

1. Introduction

In today’s world, an immense volume of text data is generated in scientific research and in our daily lives. This includes research publications, news articles, posts on social media, electronic health records, and many more. Among the various statistical text models, the topic model [1,2] stands out as one of the most widely used. Given a corpus consisting of n documents written on a vocabulary of p words, let

X = [X_{1}, X_{2}, \dots, X_{n}] \in R^{p \times n}

be the word-document-count matrix, where

X_{i} (j)

is the count of the jth word in the ith document, for

1 \leq i \leq n

and

1 \leq j \leq p

. Let

A_{1}, A_{2}, \dots, A_{K} \in R^{p}

be probability mass functions (PMFs). We call each

A_{k}

a topic vector, which represents a particular distribution over words in the vocabulary. For each

1 \leq i \leq n

, let

N_{i}

denote the length of the ith document, and let

w_{i} \in R^{K}

be a weight vector, where

w_{i} (k)

is the fractional weight this document puts on the kth topic, for

1 \leq k \leq K

. In a topic model, the columns of X are independently generated, where the ith column satisfies:

X_{i} \sim Multinomial (N_{i}, d_{i}^{0}), with d_{i}^{0} = \sum_{k = 1}^{K} w_{i} (k) A_{k} .

(1)

Here

d_{i}^{0} \in R^{p}

is the population word frequency vector for the ith document, which admits a convex combination of the K topic vectors. The

N_{i}

words in this document are sampled with replacement from the vocabulary using probabilities in

d_{i}^{0}

; as a result, the word counts follow a multinomial distribution. Under this model,

E [X]

is a rank-K matrix. The statistical problem of interest is using X to estimate the two parameter matrices

A = [A_{1}, A_{2}, \dots, A_{K}]

and

W = [w_{1}, w_{2}, \dots, w_{n}]

.

Since the topic model implies a low-rank structure behind the data matrix, spectral algorithms [3] have been developed for topic model estimation. Topic-SCORE [4] is the first spectral algorithm in the literature. It conducts singular value decomposition (SVD) on a properly normalized version of X, then uses the first K left singular vectors to estimate A, and finally uses

\hat{A}

to estimate W by weighted least-squares. Ref. [4] showed that the error rate on A is

\sqrt{p / (n N)}

up to a logarithmic factor, where N is the order of document lengths. It matches with the minimax lower bound [4] when

N \geq c \cdot p

for a constant

c > 0

, referred to as the long-document case. However, there are many application scenarios with

N = o (p)

, referred to as the short-document case. For example, if we consider a corpus consisting of abstracts of academic publications (e.g., see [3]), N is usually between 100 and 200, but p can be a few thousands or even larger. In this short-document case, ref. [4] observed a gap between the minimax lower bound and the error rate of Topic-SCORE. They posted the following questions: Is the optimal rate still

\sqrt{p / (N n)}

in the short-document case? If so, can spectral algorithms still achieve this rate?

In this paper, we give answers to these questions. We discovered that the gap between the minimax lower bound and the error rate of Topic-SCORE in the short-document case came from the unsatisfactory entry-wise large-deviation bounds for the empirical singular vectors. While the analysis in [4] is effective for long documents, there is considerable room for improvement in the short-document case. We use new analysis to obtain much better large-deviation bounds when

N = o (p)

. Our strategy includes two main components: one is an improved non-stochastic perturbation bound for SVD allowing severe heterogeneity in the population singular vectors, and the other is leveraging a decoupling inequality [5] to control the spectral norm of a random matrix with centered multinomial-distributed columns. These new ideas allow us to obtain satisfactory entry-wise large-deviation bounds for empirical singular vectors across the entire regime of

N \geq {log}^{3} (n)

. As a consequence, we are able to significantly improve the error rate of Topic-SCORE in the short-document case. This answers the two questions posted by [4]: The optimal rate is still

\sqrt{p / (N n)}

in the short-document case, and Topic-SCORE still achieves this optimal rate.

Additionally, inspired by our analysis, we have made a modification to Topic-SCORE to better incorporate document lengths. We also extend the asymptotic setting in [4] to a weak-signal regime allowing the K topic vectors to be extremely similar to each other.

1.1. Related Literature

Many topic modeling algorithms have been proposed in the literature, such as LDA [2], the separable NMF approach [6,7], the method in [8] that uses a low-rank approximation to the original data matrix, Topic-SCORE [4], and LOVE [9]. Theoretical guarantees were derived for these methods, but unfortunately, most of them had non-optimal rates even when

N \geq c \cdot p

. Topic-SCORE and LOVE are the two that achieve the optimal rate when

N \geq c \cdot p

. However, LOVE has no theoretical guarantee when

N = o (p)

; Topic-SCORE has a theoretical guarantee across the entire regime, but the rate obtained by [4] is non-optimal when

N = o (p)

. Therefore, our results address a critical gap in the existing literature by determining the optimal rate for the short-document case for the first time.

Entry-wise eigenvector analysis [10,11,12,13,14,15] provides large-deviation bounds or higher-order expansions for individual entries of the leading eigenvectors of a random matrix. There are two types of random matrices, i.e., the Wigner type (e.g., in network data and pairwise comparison data) and the Wishart type (e.g., in factor models and spiked covariance models [16]). The random matrices in topic models are the Wishart type, and hence, techniques for the Wigner type, such as the leave-one-out approach [15], are not a good fit. We cannot easily extend the techniques [11,14] for spiked covariance models either. One reason is that the multinomial distribution has heavier-than-Gaussian tails (especially for short documents), and using the existing techniques only give non-sharp bounds. Another reason is the severe word frequency heterogeneity [17] in natural languages, which calls for bounds whose orders are different for different entries of an eigenvector. Our analysis overcomes these challenges.

1.2. Organization and Notations

The rest of this paper is organized as follows. Section 2 presents our main results about entry-wise eigenvector analysis for topic models. Section 3 applies these results to obtain improved error bounds for the Topic-SCORE algorithm and determine the optimal rate in the short-document case. Section 4 describes the main technical components, along with a proof sketch. Section 5 concludes the paper with discussions. The proofs of all theorems are relegated to the Appendix A, Appendix B, Appendix C, Appendix D and Appendix E.

Throughout this paper, for a matrix B, let

B (i, j)

or

B_{i j}

represent the

(i, j)

-th entry. We denote

∥ B ∥

as its operator norm and

{∥ B ∥}_{2 \to \infty}

as the 2-to-∞ norm, which is the maximum

ℓ_{2}

norm across all rows of B. For a vector b,

b (i)

or

b_{i}

represents the i-th component. We denote

{∥ b ∥}_{1}

and

∥ b ∥

as the

ℓ_{1}

and

ℓ_{2}

norms of b, respectively. The vector

1_{n}

stands for an all-one vector of dimension n. Unless specified otherwise,

{e_{1}, e_{2}, \dots, e_{p}}

denotes the standard basis of

R^{p}

. Furthermore, we write

a_{n} ≫ b_{n}

or

b_{n} ≪ a_{n}

if

b_{n} / a_{n} = o (1)

for

a_{n}, b_{n} > 0

; and we denote

a_{n} ≍ b_{n}

if

C^{- 1} b_{n} < a_{n} < C b_{n}

for some constant

C > 1

.

2. Entry-Wise Eigenvector Analysis for Topic Models

Let

X \in R^{p \times n}

be the word-count matrix following the topic model in (1). We introduce the empirical frequency matrix

D = [d_{1}, d_{2}, \dots, d_{n}] \in R^{p \times n}

, defined by:

d_{i} (j) = N_{i}^{- 1} X_{i} (j), 1 \leq i \leq n, 1 \leq j \leq p .

(2)

Under the model in (1), we have

E [d_{i}] = d_{i}^{0} = \sum_{k = 1}^{K} w_{i} (k) A_{k}

. Write

D_{0} = [d_{1}^{0}, d_{2}^{0}, \dots, d_{n}^{0}] \in R^{p \times n}

. It follows that:

E D = D_{0} = A W .

We observe that

D_{0}

is a rank-K matrix; furthermore, the linear space spanned by the first K left singular vectors of

D_{0}

is the same as the column space of A. Ref. [4] discovered that there is a low-dimensional simplex structure that explicitly connects the first K left singular vectors of

D_{0}

with the target topic matrix A. This inspired SVD-based methods for estimating A.

However, if one directly conducts SVD on D, the empirical singular vectors can be noisy because of severe word frequency heterogeneity in natural languages [17]. In what follows, we first introduce a normalization on D in Section 2.1 to handle word frequency heterogeneity and then derive entry-wise large-deviation bounds for the empirical singular vectors in Section 2.2.

2.1. A Normalized Data Matrix

We first explain why it is inappropriate to conduct SVD on D. Let

\bar{N} = n^{- 1} \sum_{i = 1}^{n} N_{i}

denote the average document length. Write

D = A W + Z

, with

Z = [z_{1}, z_{2}, \dots, z_{n}] : = D - E D

. The singular vectors of D are the same as the eigenvectors of

D D^{'} = A W W^{'} A^{'} + A W Z^{'} + Z W^{'} A^{'} + Z Z^{'}

. By model (1), the columns of Z are centered multinomial-distributed random vectors; moreover, using the covariance matrix formula for multinomial distributions, we have

E [z_{i} z_{i}^{'}] = N_{i}^{- 1} [diag (d_{i}^{0}) - d_{i}^{0} {(d_{i}^{0})}^{'}]

. It follows that:

\begin{matrix} E [D D^{'}] = A W W^{'} A^{'} + \sum_{i = 1}^{n} N_{i}^{- 1} [diag (d_{i}^{0}) - d_{i}^{0} {(d_{i}^{0})}^{'}] \\ = A W W^{'} A^{'} + diag (\sum_{i = 1}^{n} N_{i}^{- 1} d_{i}^{0}) - A (\sum_{i = 1}^{n} N_{i}^{- 1} w_{i} w_{i}^{'}) A^{'} \\ = n \cdot A \underset{\equiv Σ_{W}}{\underset{⏟}{(\sum_{i = 1}^{n} \frac{N_{i} - 1}{n N_{i}} w_{i} w_{i}^{'})}} A^{'} + \frac{n}{\bar{N}} \cdot \underset{\equiv M_{0}}{\underset{⏟}{diag (\sum_{i = 1}^{n} \frac{\bar{N}}{n N_{i}} d_{i}^{0})}} . \end{matrix}

(3)

Here

A Σ_{W} A^{'}

is a rank-K matrix whose eigen-space is the same as the column span of A. However, because of the diagonal matrix

M_{0}

, the eigen-space of

E [D D^{'}]

is no longer the same as the column span of A. We notice that the jth diagonal of

M_{0}

captures the overall frequency of the jth word across the whole corpus. Hence, this is an issue caused by word frequency heterogeneity. The second term in (3) is larger when

\bar{N}

is smaller. This implies that the issue becomes more severe for short documents.

To resolve this issue, we consider a normalization of D to

M_{0}^{- 1 / 2} D

. It follows that:

E [M_{0}^{- 1 / 2} D D^{'} M_{0}^{- 1 / 2}] = n \cdot M_{0}^{- 1 / 2} A Σ_{W} A^{'} M_{0}^{- 1 / 2} + \frac{n}{\bar{N}} I_{p} .

(4)

Now, the second term is proportional to an identify matrix and no longer affects the eigen-space. Furthermore, the eigen-space of the first term is the column span of

M_{0}^{- 1 / 2} A

, and hence, we can use the eigenvectors to recover

M_{0}^{- 1 / 2} A

(then A is immediately known). In practice,

M_{0}

is not observed, so we replace it by its empirical version:

M = diag (\sum_{i = 1}^{n} \frac{\bar{N}}{n N_{i}} d_{i}) .

(5)

We propose to normalize D to

M^{- 1 / 2} D

before conducting SVD. Later, the singular vectors of

M^{- 1 / 2} D

will be used in Topic-SCORE to estimate A (see Section 3).

This normalization is similar to the pre-SVD normalization in [4] but not exactly the same. Inspired by analyzing a special case where

N_{i} = N

, ref. [4] proposed to normalize D to

{\tilde{M}}^{- 1 / 2} D

, where

\tilde{M} = diag (n^{- 1} \sum_{i = 1}^{n} d_{i})

. They continued using

\tilde{M}

in general settings, but we discover here that the adjustment of

\tilde{M}

to M is necessary when

N_{i}

’s are unequal.

Remark 1.

For extremely low-frequency words, the corresponding diagonal entries of M are very small. This causes an issue when we normalize D to

M^{- 1 / 2} D

. Fortunately, such an issue disappears if we pre-process data. As a standard pre-processing step for topic modeling, we either remove those extremely low-frequency words or combine all of them into a single “meta-word”. We recommend the latter approach. In detail, let

L \subset {1, 2, \dots, p}

be the set of words such that

M (j, j)

is below a proper threshold

t_{n}

(e.g.,

t_{n}

can be

0.05

times the average of diagonal entries of M). We then sum up all rows of D with indices in

L

to a single row. Let

D^{*} \in R^{(p - | L | + 1) \times n}

be the processed data matrix. The matrix

D^{*}

still has a topic model structure, where each new topic vector results from a similar row combination on the corresponding original topic vector.

Remark 2.

The normalization of D to

M^{- 1 / 2} D

is reminiscent of the Laplacian normalization in network data analysis, but the motivation is very different. In many network models, the adjacency matrix satisfies that

B = B_{0} + Y

, where

B_{0}

is a low-rank matrix and Y is a generalized Wigner matrix. Since

E [Y]

is a zero matrix, the eigen-space of

E B

is the same as that of

B_{0}

. Hence, the role of the Laplacian normalization is not correcting the eigen-space but adjusting the signal-to-noise ratio [15]. In contrast, our normalization here aims to turn

E [Z Z^{'}]

into an identity matrix (plus a small matrix that can be absorbed into the low-rank part). We need such a normalization even under moderate word frequency heterogeneity (i.e., the frequencies of all words are at the same order).

2.2. Entry-Wise Singular Analysis for $M^{- 1 / 2} D$

For each

1 \leq k \leq K

, let

{\hat{ξ}}_{k} \in R^{p}

denote the kth left singular vector of

M^{- 1 / 2} D

. Recall that

D_{0} = E D

. In addition, define:

M_{0} : = E M = diag (\sum_{i = 1}^{n} \frac{\bar{N}}{n N_{i}} d_{i}^{0}) .

(6)

Then,

M_{0}^{- 1 / 2} D_{0}

is a population counterpart of

M^{- 1 / 2} D

. However, the singular vectors of

M_{0}^{- 1 / 2} D_{0}

are not the population counterpart of

{\hat{ξ}}_{k}

’s. In light of (4), we define:

ξ_{k} : the k t h eigenvector of M_{0}^{- 1 / 2} E [D D^{'}] M_{0}^{- 1 / 2}, 1 \leq k \leq K .

(7)

Write

\hat{Ξ} : = [{\hat{ξ}}_{1}, \dots, {\hat{ξ}}_{K}]

and

Ξ : = [ξ_{1}, \dots, ξ_{K}]

. We aim to derive a large-deviation bound for each individual row of

(\hat{Ξ} - Ξ)

, subject to a column rotation of

\hat{Ξ}

.

We need a few assumptions. Let

h_{j} = \sum_{k = 1}^{K} A_{k} (j)

for

1 \leq j \leq p

. Define:

H = diag (h_{1}, \dots, h_{p}), Σ_{A} = A^{'} H^{- 1} A, Σ_{W} = \frac{1}{n} \sum_{i = 1}^{n} (1 - N_{i}^{- 1}) w_{i} w_{i}^{'} .

(8)

Here

Σ_{A}

and

Σ_{W}

are called the topic-topic overlapping matrix and the topic-topic concurrence matrix, respectively, [4]. It is easy to see that

Σ_{W}

is properly scaled. We remark that

Σ_{A}

is also properly scaled, because

\sum_{ℓ = 1}^{K} Σ_{A} (k, ℓ) = \sum_{j = 1}^{p} \sum_{ℓ = 1}^{K} h_{j}^{- 1} A_{k} (j) A_{ℓ} (j) = 1

.

Assumption 1.

Let

h_{max} = {max}_{1 \leq j \leq p} h_{j}

,

h_{min} = {min}_{1 \leq j \leq p} h_{j}

and

\bar{h} = \frac{1}{p} \sum_{j = 1}^{p} h_{j}

. We assume:

h_{min} \geq c_{1} \bar{h} = c_{1} K / p, for a constant c_{1} \in (0, 1) .

Assumption 2.

For a constant

c_{2} \in (0, 1)

and a sequence

β_{n} \in (0, 1)

, we assume:

λ_{min} (Σ_{W}) \geq c_{2}, λ_{min} (Σ_{A}) \geq c_{2} β_{n}, min_{1 \leq k, ℓ \leq K} Σ_{A} (k, ℓ) \geq c_{2} .

Assumption 1 is related to word frequency heterogeneity. Each

h_{j}

captures the overall frequency of word j, and

\bar{h} = p^{- 1} \sum_{j} h_{j} = p^{- 1} \sum_{k} {∥ A_{k} ∥}_{1} = K / p

. By Remark 1, all extremely low-frequency words have been combined in pre-processing. It is reasonable to assume that

h_{min}

is at the same order of

\bar{h}

. Meanwhile, we put no restrictions here on

h_{max}

, so that

h_{j}

’s can still be at different orders.

Assumption 2 is about topic weight balance and between-topic similarity.

Σ_{W}

can be regarded as an affinity matrix of

w_{i}

’s. It is mild to assume that

Σ_{W}

is well-conditioned. In a special case where

N_{i} = N

and each

w_{i}

is degenerate,

Σ_{W}

is a diagonal matrix whose kth diagonal entry is the fraction of documents that put all weights on topic k; hence,

λ_{min} (Σ_{W}) \geq c_{2}

is interpreted as “topic weight balance”. Regarding

Σ_{A}

, we have seen that it is properly scaled (its maximum eigenvalue is at the constant order). When K topic vectors are exactly the same,

λ_{min} (Σ_{A}) = 0

; when the topic vectors are not the same,

λ_{min} (Σ_{A}) \neq 0

, and it measures the signal strength. Ref. [4] assumed that

λ_{min} (Σ_{A})

is bounded below by a constant, but we allow weaker signals by allowing

λ_{min} (Σ_{A})

to diminish as

n \to \infty

. We also require a lower bound on

Σ_{A} (k, ℓ)

, meaning that there should be certain overlaps between any two topics. This is reasonable as some commonly used words are not exclusive to any one topic and tend to occur frequently [4].

The last assumption is about the vocabulary size and document lengths.

Assumption 3.

There exists

N \geq 1

and a constant

c_{3} \in (0, 1)

such that

c_{3} N \leq N_{i} \leq c_{3}^{- 1} N

for all

1 \leq i \leq n

. In addition, for an arbitrary constant

C_{0} > 0

:

min {p, N} \geq {log}^{3} (n), max {log (p), log (N)} \leq C_{0} log (n), p {log}^{2} (n) \leq N n β_{n}^{2} .

In Assumption 3, the first two inequalities restrict that N and p are between

{log}^{3} (n)

and

n^{C_{0}}

, for an arbitrary constant

C_{0} > 0

. This covers a wide regime, including the scenarios of both long documents (

N \geq c \cdot p

) and short documents (

N = o (p)

). The third inequality is needed so that the canonical angles between the empirical and population singular spaces converge to zero, which is necessary for our singular vector analysis. This condition is mild, as

N n

is the order of total word count in the corpus, which is often much larger than p.

With these assumptions, we now present our main theorem.

Theorem 1

(Entry-wise singular vector analysis). Fix

K \geq 2

and positive constants

c_{1}, c_{2}, c_{3}

, and

C_{0}

. Under the model (1), suppose Assumptions 1–3 hold. For any constant

C_{1} > 0

, there exists

C_{2} > 0

such that with probability

1 - n^{- C_{1}}

, there is an orthogonal matrix

O \in R^{K \times K}

satisfying that simultaneously for

1 \leq j \leq p

:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C_{2} \sqrt{\frac{h_{j} p log (n)}{n N β_{n}^{2}}} .

The constant

C_{2}

only depends on

C_{1}

and

(K, c_{1}, c_{2}, c_{3}, C_{0})

.

In Theorem 1, we do not assume any gap among the K singular values of

M_{0}^{- 1 / 2} D_{0}

; hence, it is only possible to recover

Ξ

up to a column rotation O. The sin-theta theorem [18] enables us to bound

∥ \hat{Ξ} - Ξ O^{'} ∥_{F}^{2} = \sum_{j = 1}^{p} {∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥}^{2}

, but it is insufficient for analyzing spectral algorithms for topic modeling (see Section 3). We need a bound for each individual row of

(\hat{Ξ} - Ξ O^{'})

, and this bound should depend on

h_{j}

properly.

We compare Theorem 1 with the result in [4]. They assumed that

β_{n}^{- 1} = O (1)

, so their results are only for the strong-signal regime. They showed that when n is sufficiently large:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C (1 + min \{\frac{p}{N}, \frac{p^{2}}{N \sqrt{N}}\}) \sqrt{\frac{h_{j} p log (n)}{n N}} .

(9)

When

N \geq c \cdot p

(long documents), it is the same bound as in Theorem 1 (with

β_{n} = 1

). However, when

N = o (p)

(short documents), it is strictly worse than Theorem 1. We obtain better bounds than those in [4] because of new proof ideas, especially the use of refined perturbation analysis for SVD and a decoupling technique for U-statistics (see Section 4.2).

3. Improved Rates for Topic Modeling

We apply the results in Section 2 to improve the error rates of topic modeling. Topic-SCORE [4] is a spectral algorithm for estimating the topic matrix A. It achieves the optimal rate in the long-document case (

N \geq c \cdot p

). However, in the short-document case (

N = o (p)

), the known rate of Topic-SCORE does not match with the minimax lower bound. We address this gap by providing better error bounds for Topic-SCORE. Our results reveal the optimal rate for topic modeling in the short-document case for the first time.

3.1. The Topic-Score Algorithm

Let

{\hat{ξ}}_{1}, {\hat{ξ}}_{2}, \dots, {\hat{ξ}}_{K}

be as in Section 2. Topic-SCORE first obtains word embeddings from these singular vectors. Note that

M^{- 1 / 2} D

is a non-negative matrix. By Perron’s theorem [19], under mild conditions,

{\hat{ξ}}_{1}

is a strictly positive vector. Define

\hat{R} \in R^{p \times (K - 1)}

by:

\hat{R} (j, k) = {\hat{ξ}}_{k + 1} (j) / {\hat{ξ}}_{1} (j), 1 \leq j \leq p, 1 \leq k \leq K - 1 .

(10)

Let

{\hat{r}}_{1}^{'}, {\hat{r}}_{2}^{'}, \dots, {\hat{r}}_{p}^{'}

denote the rows of

\hat{R}

. Then,

{\hat{r}}_{j}

is a

(K - 1)

-dimensional embedding of the jth word in the vocabulary. This is known as the SCORE embedding [20,21], which is now widely used in analyzing heterogeneous network and text data.

Ref. [4] discovered that there is a simplex structure associated with these word embeddings. Specifically, let

ξ_{1}, ξ_{2}, \dots, ξ_{K}

be the same as in (7) and define the population counterpart of

\hat{R}

as R, where:

R (j, k) = ξ_{k + 1} (j) / ξ_{1} (j), 1 \leq j \leq p, 1 \leq k \leq K - 1 .

(11)

Let

r_{1}^{'}, r_{2}^{'}, \dots, r_{p}^{'}

denote the rows of R. All these

r_{j}

are contained in a simplex

S \subset R^{K - 1}

that has K vertices

v_{1}, v_{2}, \dots, v_{K}

(see Figure 1). If the jth word is an anchor word [6,22] (an anchor word of topic k satisfies that

A_{k} (j) \neq 0

and

A_{ℓ} (j) = 0

for all other

ℓ \neq k

), then

r_{j}

is located at one of the vertices. Therefore, as long as each topic has at least one anchor word, we can apply a vertex hunting [4] algorithm to recover the K vertices of

S

. By definition of a simplex, each point inside

S

can be written uniquely as a convex combination of the K vertices, and the K-dimensional vector consisting of the convex combination coefficients is called the barycentric coordinate. After recovering the vertices of

S

, we can easily compute the barycentric coordinate

π_{j} \in R^{K}

for each

r_{j}

. Write

Π = {[π_{1}, π_{2}, \dots, π_{p}]}^{'}

. Ref. [4] showed that:

A_{k} \propto M_{0}^{1 / 2} diag (ξ_{1}) Π e_{k}, 1 \leq k \leq K .

Therefore, we can recover

A_{k}

by taking the kth column of

M_{0}^{1 / 2} diag (ξ_{1}) Π

and re-normalizing it to have a unit

ℓ^{1}

-norm. This gives the main idea behind Topic-SCORE (see Figure 1).

The full algorithm is given in Algorithm 1. It requires plugging in a vertex hunting (VH) algorithm. A VH algorithm aims to estimate

v_{1}, v_{2}, \dots, v_{K}

from the noisy point cloud

{{\hat{r}}_{j}}_{1 \leq j \leq p}

. There are many existing VH algorithms (see sec 3.4 of [21]). A VH algorithm is said to be efficient if it satisfies that

{max}_{1 \leq k \leq K} ∥ {\hat{v}}_{k} - v_{k} ∥ \leq C {max}_{1 \leq j \leq p} ∥ {\hat{r}}_{j} - r_{j} ∥

(subject to a permutation of

{\hat{v}}_{1}, {\hat{v}}_{2}, \dots, {\hat{v}}_{K}

). We always plug in an efficient VH algorithm, such as the successive projection algorithm [23], the pp-SPA algorithm [24], and several algorithms in sec 3.4 of [21].

Algorithm 1 Topic-SCORE

Input: D, K, and a vertex hunting (VH) algorithm.

(Word embedding) Let M be as in (5). Obtain ${\hat{ξ}}_{1}, {\hat{ξ}}_{2}, \dots, {\hat{ξ}}_{K}$ , the first K left singular vectors of $M^{- 1 / 2} D$ . Compute $\hat{R}$ as in (10) and write $\hat{R} = {[{\hat{r}}_{1}, {\hat{r}}_{2}, \dots, {\hat{r}}_{p}]}^{'}$ .
(Vertex hunting). Apply the VH algorithm on ${{\hat{r}}_{j}}_{1 \leq j \leq p}$ to get ${\hat{v}}_{1}, \dots, {\hat{v}}_{K}$ .
(Topic matrix estimation) For $1 \leq j \leq p$ , solve ${\hat{π}}_{j}^{*}$ from:

$(\begin{matrix} 1 & \dots & 1 \\ {\hat{v}}_{1} & \dots & {\hat{v}}_{K} \end{matrix}) {\hat{π}}_{j}^{*} = (\begin{matrix} 1 \\ {\hat{r}}_{j} \end{matrix}) .$

Let ${\tilde{π}}_{j}^{*} = max {{\hat{π}}_{j}^{*}, 0}$ (the maximum is taken component-wise) and ${\hat{π}}_{i} = {\tilde{π}}_{j}^{*} / {∥ {\tilde{π}}_{j}^{*} ∥}_{1}$ . Write $\hat{Π} = {[{\hat{π}}_{1}, \dots, {\hat{π}}_{p}]}^{'}$ . Let $\tilde{A} = M^{1 / 2} diag ({\hat{ξ}}_{1}) \hat{Π}$ . Obtain $\hat{A} = \tilde{A} {[diag (1_{p}^{'} \tilde{A})]}^{- 1}$ .

Output: the estimated topic matrix

\hat{A}

.

Additionally, after

\hat{A}

is obtained, ref. [4] suggested to estimate

w_{1}, w_{2}, \dots, w_{n}

as follows. We first run a weighted least-squares to obtain

{\hat{w}}_{i}^{*}

:

{\hat{w}}_{i}^{*} = {argmin}_{w \in R^{K}} {∥ M^{- 1 / 2} (d_{i} - A w_{i}) ∥}^{2}, 1 \leq i \leq n .

(12)

Then, set all the negative entries of

{\hat{w}}_{i}^{*}

to zero and re-normalize the vector to have a unit

ℓ^{1}

-norm. The resulting vector is

{\hat{w}}_{i}

.

Remark 3.

In real-world applications, both n and p can be very large. However, since

\hat{R}

is constructed from only a few singular vectors, its rows are only in dimension

(K - 1)

. It suggests that Topic-SCORE leverages a ‘low-dimensional’ simplex structure and is scalable to large datasets. When K is bounded, the complexity of Topic-SCORE is at most

O (n p min {n, p})

[4]. The real computing time was also reported in [4] for various values of (

n, p

). For example, when both n and p are a few thousands, it takes only a few seconds to run Topic-SCORE.

3.2. The Improved Rates for Estimating A and W

We provide the error rates of Topic-SCORE. First, we study the word embeddings

{\hat{r}}_{j}

. By (10),

{\hat{r}}_{j}

is constructed from the jth row of

\hat{Ξ}

. Therefore, we can apply Theorem 1 to derive a large-deviation bound for

{\hat{r}}_{j}

.

Without loss of generality, we set

C_{1} = 4

henceforth, transforming the event probability

1 - n^{- C_{1}}

in Theorem 1 to

1 - o (n^{- 3})

. We also use C to denote a generic constant, whose meaning may change from one occurrence to another. In all instances, C depends sorely on K and the constants

(c_{1}, c_{2}, c_{3}, C_{0})

in Assumptions 1–3.

Theorem 2

(Word embeddings). Under the setting of Theorem 1, with probability

1 - o (n^{- 3})

, there exists an orthogonal matrix

Ω \in R^{(K - 1) \times (K - 1)}

such that simultaneously for

1 \leq j \leq p

:

∥ {\hat{r}}_{j} - Ω r_{j} ∥ \leq C \sqrt{\frac{p log (n)}{n N β_{n}^{2}}} .

Next, we study the error of

\hat{A}

. The

ℓ^{1}

-estimation error is

L (\hat{A}, A) : = \sum_{k = 1}^{K} {∥ {\hat{A}}_{k} - A_{k} ∥}_{1}

, subject to an arbitrary column permutation of

\hat{A}

. For ease of notation, we do not explicitly denote this permutation in theorem statements, but it is accounted for in the proofs. For each

1 \leq j \leq p

, let

{\hat{a}}_{j}^{'} \in R^{K}

and

a_{j}^{'} \in R^{K}

denote the jth row of

\hat{A}

and A, respectively. We can re-write the

ℓ^{1}

-estimation error as

L (\hat{A}, A) = \sum_{j = 1}^{p} {∥ {\hat{a}}_{j} - a_{j} ∥}_{1}

. The next theorem provides an error bound for each individual

{\hat{a}}_{j}

, and the aggregation of these bounds yields an overall bound for

L (\hat{A}, A)

:

Theorem 3

(Estimation of A). Under the setting of Theorem 1, we additionally assume that each topic has at least one anchor word. With probability

1 - o (n^{- 3})

, simultaneously for

1 \leq j \leq p

:

∥ {\hat{a}}_{j} - a_{j} ∥_{1} \leq {∥ a_{j} ∥}_{1} \cdot C \sqrt{\frac{p log (n)}{n N β_{n}^{2}}} .

Furthermore, with probability

1 - o (n^{- 3})

, the

ℓ^{1}

-estimation error satisfies that:

L (\hat{A}, A) \leq C \sqrt{\frac{p log (n)}{n N β_{n}^{2}}} .

Theorem 3 improves the result in [4] in two aspects. First, [4] assumed

β_{n}^{- 1} = O (1)

, so their results did not allow for weak signals. Second, even when

β_{n}^{- 1} = O (1)

, their bound is worse than ours by a factor similar to that in (9).

Finally, we have the error bound for estimating

w_{i}

’s using the estimator in (12).

Theorem 4

(Estimation of W). Under the setting of Theorem 3, with probability

1 - o (n^{- 3})

, subject to a column permutation of

\hat{W}

:

max_{1 \leq i \leq n} {∥ {\hat{w}}_{i} - w_{i} ∥}_{1} \leq C β_{n}^{- 1} (\sqrt{\frac{p log (n)}{n N β_{n}^{2}}} + C \sqrt{\frac{log (n)}{N}}) .

In Theorem 4, there are two terms in the error bound of

{\hat{w}}_{i}

. The first term comes from the estimation error in

\hat{A}

, and the second term is from noise in

d_{i}

. In the strong-signal case of

β_{n}^{- 1} = O (1)

, we can compare Theorem 4 with the bound for

{\hat{w}}_{i}

in [4]. The bound there also has two terms: its second term is similar to ours, but its first term is strictly worse.

3.3. Connections and Comparisons

There have been numerous results about the error rates of estimating A and W. For example, ref. [6] provided the first explicit theoretical guarantees for topic modeling, but they did not study the statistical optimality of their method. Recently, the statistical literature aimed to understand the fundamental limits of topic modeling. Assuming

β_{n}^{- 1} = O (1)

, refs. [4,9] gave a minimax lower bound,

\sqrt{p / (N n)}

, for the rate of estimating A, and refs. [25,26] gave a minimax lower bound,

1 / \sqrt{N}

, for estimating each

w_{i}

.

For estimating A, when

β_{n}^{- 1} = O (1)

, the existing theoretical results are summarized in Table 1. Ours is the only one that matches the minimax lower bound across the entire regime. In the long-document case (

N \geq c \cdot p

, Cases 1–2 in Table 1), the error rates in [4,9] together have matched the lower bound, concluding that

\sqrt{p / (N n)}

is indeed the optimal rate. However, in the short-document case (

N = o (p)

, Case 3 in Table 1), there was a gap between the lower bound and the existing error rates. Our result addresses the gap and concludes that

\sqrt{p / (N n)}

is still the optimal rate. When

β_{n} = o (1)

, the error rates of estimating A were rarely studied. We conjecture that

\sqrt{p / (N n β_{n}^{2})}

is the optimal rate, and the Topic-SCORE algorithm is still rate-optimal.

We emphasize that our rate is not affected by severe word frequency heterogeneity. As long as

h_{min} / \bar{h}

is lower bounded by a constant (see Assumption 1 and explanations therein), our rate is always the same, regardless of the magnitude of

h_{max}

. In contrast, the error rate in [9] is sensitive to word frequency heterogeneity, with an extra factor of

h_{max} / h_{min}

that can be as large as p. There are two reasons that enable us to achieve a flat rate even under severe word frequency heterogeneity: one is the proper normalization of data matrix, as described in Section 2.1, and the other is the careful analysis of empirical singular vectors (see Section 4).

For estimating W, when

β_{n}^{- 1} = O (1)

, our error rate in Theorem 4 matches the minimax lower bound if

n \geq p log (n)

. Our approach to estimating W involves first obtaining

\hat{A}

and then regressing

d_{i}

on

\hat{A}

to derive

{\hat{w}}_{i}

. The condition

n \geq p log (n)

ensures that the estimation error in

\hat{A}

does not dominate the overall error. This condition is often met in scenarios where a large number of documents can be collected, but the vocabulary size remains relatively stable. However, if

n < p log (n)

, a different approach is necessary, requiring the estimation of W first. This involves using the right singular vectors of

M^{- 1 / 2} D

. While our analysis has primarily focused on the left singular vectors, it can be extended to study the right singular vectors as well.

4. Proof Ideas

Our main result is Theorem 1, which provides entry-wise large-deviation bounds for singular vectors of

M^{- 1 / 2} D

. Given this theorem, the proofs of Theorems 2–4 are similar to those in [4] and thus relegated to the appendix. In this section, we focus on discussing the proof techniques of Theorem 1.

4.1. Why the Leave-One-Out Technique Fails

Leave-one-out [13,15] is a common technique in entry-wise eigenvector analysis for a Wigner-type random matrix

B = B_{0} + Y \in R^{m \times m}

, where

B_{0}

is a symmetric non-stochastic low-rank matrix and Y is a symmetric random matrix whose upper triangle consists of independent mean-zero variables. One example of such matrices is the adjacency matrix of a random graph generated from the block-model family [20].

However, our target here is the singular vectors of

M^{- 1 / 2} D

, which are the eigenvectors of

B : = M^{- 1 / 2} D D^{'} M^{- 1 / 2}

. This is a Wishart-type random matrix, whose upper triangular entries are not independent. We may also construct a symmetric matrix:

G : = (\begin{matrix} 0 & M^{- 1 / 2} D \\ D^{'} M^{- 1 / 2} & 0 \end{matrix}) \in R^{(p + n) \times (p + n)} .

The eigenvectors of

G

take the form

{\hat{u}}_{k} = {({\hat{ξ}}_{k}^{'}, {\hat{η}}_{k}^{'})}^{'}

,

1 \leq k \leq K

, where

{\hat{ξ}}_{k} \in R^{p}

and

{\hat{η}}_{k} \in R^{n}

are the kth left and right singular vectors of

M^{- 1 / 2} D

, respectively. Unfortunately, the upper triangle of

G

still contains dependent entries. Some dependence is from the normalization matrix M. It may be addressed by using the techniques developed by [15] in studying graph Laplacian matrices. A more severe issue is the dependence among entries in D. According to basic properties of multinomial distributions, D only has column independence but no row independence. As a result, even after we replace M by

M_{0}

, the jth row and column of

G

are still dependent of the remaining ones, for each

1 \leq j \leq p

. In conclusion, we cannot apply the leave-one-out technique on either B or

G

.

4.2. The Proof Structure in [4] and Why It Is Not Sharp for Short Documents

Our entry-wise eigenvector analysis primarily follows the proof structure in [4]. Recall that

{\hat{ξ}}_{k} \in R^{p}

is the kth left singular vector of

M^{- 1 / 2} D

. Define:

G : = M^{- 1 / 2} D D^{'} M^{- 1 / 2} - \frac{n}{\bar{N}} I_{p}, G_{0} : = n \cdot M_{0}^{- 1 / 2} A Σ_{W} A^{'} M_{0}^{- 1 / 2} .

(13)

Since the identify matrix in G does not affect the eigenvectors,

{\hat{ξ}}_{k}

is the kth eigenvector of G. Additionally, it follows from (7) and (4) that

ξ_{k}

is the kth eigenvector of

G_{0}

. By (4):

G - G_{0} = M^{- 1 / 2} D D^{'} M^{- 1 / 2} - M_{0}^{- 1 / 2} E [D D^{'}] M_{0}^{- 1 / 2} .

(14)

The entry-wise eigenvector analysis in [4] has two steps. Step 1: Non-stochastic perturbation analysis. In this step, no distributional assumptions are made on G. The analysis solely focuses on connecting the perturbation from

Ξ

to

\hat{Ξ}

with the perturbation from

G_{0}

to G. They showed in Lemma F.1 [4]:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C ∥ G_{0} ∥^{- 1} (∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + \sqrt{K} ∥ e_{j}^{'} (G - G_{0}) ∥) .

(15)

Step 2: Large-deviation analysis of

G - G_{0}

. In this step, ref. [4] derived the large-deviation bounds for

∥ G - G_{0} ∥

and

∥ e_{j}^{'} (G - G_{0}) ∥

under the multinomial model (1). For example, they showed in Lemma F.5 [4] that when n is properly large, with high probability:

∥ G - G_{0} ∥ \leq C (1 + N^{- 1} \sqrt{p}) \sqrt{\frac{n p log (n)}{N}} .

(16)

However, when

N = o (p)

(short documents), neither step is sharp. In (15), the second term

∥ e_{j}^{'} (G - G_{0}) ∥

was introduced as an upper bound for

∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥

, but this bound is too crude. In Section 4.3, we will conduct careful analysis of

∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥

and introduce a new perturbation bound which significantly improves (15). In (16), the spectral norm is controlled via an

ε

-net argument [27], which reduces the analysis to studying a quadratic form of Z; ref. [4] analyzed this quadratic form by applying martingale Bernstein inequality. Unfortunately, in the short-document case, it is hard to control the conditional variance process of the underlying martingale. In Section 4.4, we address it by leveraging the matrix Bernstein inequality [28] and the decoupling inequality [5,29] for U-statistics.

4.3. Non-Stochastic Perturbation Analysis

In this subsection, we abuse notations to use G and

G_{0}

to denote two arbitrary

p \times p

symmetric matrices, with

rank (G_{0}) = K

. For

1 \leq k \leq K

, let

{\hat{λ}}_{k}

and

λ_{k}

be the kth largest eigenvalue (in magnitude) of G and

G_{0}

, respectively, and let

{\hat{ξ}}_{k} \in R^{p}

and

ξ_{k} \in R^{p}

be the associated eigenvectors. Write

\hat{Λ} = diag ({\hat{λ}}_{1}, {\hat{λ}}_{2}, \dots, {\hat{λ}}_{K})

,

\hat{Ξ} = [{\hat{ξ}}_{1}, {\hat{ξ}}_{2}, \dots, {\hat{ξ}}_{K}]

, and define

Λ

and

Ξ

similarly. Let

U \in R^{K \times K}

and

V \in R^{K \times K}

be such that its columns contain the left and right singular vectors of

{\hat{Ξ}}^{'} Ξ

, respectively. Define

sgn ({\hat{Ξ}}^{'} Ξ) = U^{'} V

. For any matrix B and

q > 0

, let

{∥ B ∥}_{q \to \infty} = {max}_{i} {∥ e_{i}^{'} B ∥}_{q}

.

Lemma 1.

Suppose

∥ G - G_{0} ∥ \leq (1 - c_{0}) | {\hat{λ}}_{K} |

, for some

c_{0} \in (0, 1)

. Consider an arbitrary

p \times p

diagonal matrix

Γ = diag (γ_{1}, γ_{2}, \dots, γ_{p})

, where:

γ_{j} > 0 i s a n u p p e r b o u n d f o r ∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) Ξ ∥ .

If

∥ Γ^{- 1} (G - G_{0}) {Γ ∥}_{1 \to \infty} \leq (1 - c_{0}) | {\hat{λ}}_{K} |

, then for the orthogonal matrix

O = sgn ({\hat{Ξ}}^{'} Ξ)

, it holds simultaneously for

1 \leq j \leq p

that:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq c_{0}^{- 1} {| {\hat{λ}}_{K} |}^{- 1} γ_{j} .

Since

γ_{j}

is an upper bound for

∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) Ξ ∥

, we can interpret the result in Lemma 1 as:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C | {\hat{λ}}_{K} |^{- 1} (∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) Ξ ∥) .

(17)

Comparing (17) with (15), the second term has been reduced. Since

Ξ

projects the vector

e_{j}^{'} (G - G_{0})

into a much lower dimension, we expect that

∥ e_{j}^{'} (G - G_{0}) Ξ ∥ ≪ ∥ e_{j}^{'} (G - G_{0}) ∥

in many random models for G. In particular, this is true for the G and

G_{0}

defined in (13). Hence, there is a significant improvement over the analysis in [4].

4.4. Large-Deviation Analysis of $(G - G_{0})$

In this subsection, we focus on the specific G and

G_{0}

as defined in (13). The crux of proving Theorem 1 lies in determining the upper bound

γ_{j}

as defined in Lemma 1. This is accomplished through the following lemma.

Lemma 2.

Under the settings of Theorem 1, let G and

G_{0}

be as in (13). For any constant

C_{1} > 0

, there exists

C_{3} > 0

such that with probability

1 - n^{- C_{1}}

, simultaneously for

1 \leq j \leq p

:

∥ G - G_{0} ∥ \leq C_{3} \sqrt{\frac{p n log (n)}{N}}, ∥ e_{j}^{'} (G - G_{0}) Ξ ∥ \leq C_{3} \sqrt{\frac{h_{j} n p log (n)}{N}} .

The constant

C_{3}

only depends on

C_{1}

and

(K, c_{1}, c_{2}, c_{3}, C_{0})

.

We compare the bound for

∥ G - G_{0} ∥

in Lemma 2 with the one in [4] as summarized in (16). There is a significant improvement when

N \leq p^{2}

. This improvement primarily stems from the application of a decoupling inequality for U-statistics, as elaborated below.

We outline the proof of the bound for

∥ G - G_{0} ∥

. Let

Z = D - E [D] = [z_{1}, z_{2}, \dots, z_{n}]

. From (A24) and (A25) in Appendix A,

G - G_{0}

decomposes into the sum of four matrices, where it is most subtle to bound the spectral norm of the fourth matrix:

E_{4} : = M_{0}^{- 1 / 2} (Z Z^{'} - E [Z Z^{'}]) M_{0}^{- 1 / 2} .

Define

X_{i} = (M_{0}^{- 1 / 2} z_{i}) {(M_{0}^{- 1 / 2} z_{i})}^{'} - E [{(M_{0}^{- 1 / 2} z_{i}) (M_{0}^{- 1 / 2} z_{i})}^{'}]

. It is seen that

E_{4} = \sum_{i = 1}^{n} X_{i}

, which is a sum of n independent matrices. We apply the matrix Bernstein inequality [28] (Theorem A1) to obtain that if there exist

b > 0

and

σ^{2} > 0

such that

∥ X_{i} ∥ \leq b

almost surely for all i and

∥ \sum_{i = 1}^{n} E X_{i}^{2} ∥ \leq σ^{2}

, then for every

t > 0

,

P (∥ \sum_{i = 1}^{n} X_{i} ∥ \geq t) \leq 2 p exp (- \frac{t^{2} / 2}{σ^{2} + b t / 3}) .

Determination of b and

σ^{2}

requires upper bounds for

∥ X_{i} ∥

and

∥ E X_{i}^{2} ∥

. Since each

X_{i}

is equal to a rank-1 matrix minus its expectation, it reduces to deriving large-deviation bounds for

∥ M_{0}^{- 1 / 2} z_{i} ∥^{2}

. Note that each

z_{i}

can be equivalently represented by

z_{i} = N_{i}^{- 1} \sum_{m = 1}^{N} (T_{i m} - E T_{i m})

, where

{T_{i m}}_{m = 1}^{N_{i}}

are i.i.d. Multinomial

(1, d_{i}^{0})

. It yields that

∥ M_{0}^{- 1 / 2} z_{i} ∥^{2} = I_{1} + I_{2}

, where

I_{2}

is a term that can be controlled using standard large-deviation inequalities, and:

I_{1} : = N_{i}^{- 2} \sum_{1 \leq m_{1} \neq m_{2} \leq N_{i}} (T_{i m_{1}} - E T_{i m_{1}}) M_{0}^{- 1} (T_{i m_{2}} - E T_{i m_{2}}) .

The remaining question is how to bound

| I_{1} |

. We notice that

I_{1}

is a U-statistic with degree 2. The decoupling inequality [5,29] is a useful tool for studying U-statistics.

Theorem 5

(A special decoupling inequality [29]). Let

{X_{m}}_{m = 1}^{N}

be a sequence of i.i.d. random vectors in

R^{d}

, and let

{{\tilde{X}}_{m}}_{m = 1}^{N}

be an independent copy of

{X_{m}}_{m = 1}^{N}

. Suppose that

h : R^{2 d} \to R

is a measurable function. Then, there exists a constant

C_{4} > 0

independent of

n, m, d

such that for all

t > 0

:

\begin{matrix} P (| \sum_{m \neq m_{1}} h (X_{m}, X_{m_{1}}) | \geq t) \leq C_{4} P (C_{4} | \sum_{m \neq m_{1}} h (X_{m}, {\tilde{X}}_{m_{1}}) | \geq t) . \end{matrix}

Let

{{\tilde{T}}_{i m}}_{m = 1}^{N_{i}}

be an independent copy of

{T_{i m}}_{m = 1}^{N_{i}}

. By Theorem 5, the large-deviation bound of

I_{1}

can be inferred from the large-deviation bound of:

{\tilde{I}}_{1} : = N_{i}^{- 2} \sum_{1 \leq m_{1} \neq m_{2} \leq N_{i}} {(T_{i m_{1}} - E T_{i m_{1}})}^{'} M_{0}^{- 1} ({\tilde{T}}_{i m_{2}} - E {\tilde{T}}_{i m_{2}}) .

Using

h (T_{i m_{1}}, {\tilde{T}}_{i m_{2}})

to denote the summand in the above sum, we have a decomposition:

{\tilde{I}}_{1} = N_{i}^{- 2} \sum_{m_{1}, m_{2}} h (T_{i m_{1}}, {\tilde{T}}_{i m_{2}}) - N_{i}^{- 2} \sum_{m} h (T_{i m}, {\tilde{T}}_{i m})

. The second term is a sum of independent variables and can be controlled by standard large-deviation inequalities. Hence, the analysis of

{\tilde{I}}_{1}

reduces to the analysis of

{\tilde{I}}_{1}^{*} : = N_{i}^{- 2} \sum_{m_{1}, m_{2}} h (T_{i m_{1}}, {\tilde{T}}_{i m_{2}})

. We re-write

{\tilde{I}}_{1}^{*}

as:

{\tilde{I}}_{1}^{*} = N_{i}^{- 2} y^{'} \tilde{y}, with y : = \sum_{m = 1}^{N_{i}} M_{0}^{- 1 / 2} (T_{i m} - E T_{i m}), \tilde{y} : = \sum_{m = 1}^{N_{i}} M_{0}^{- 1 / 2} ({\tilde{T}}_{i m} - E {\tilde{T}}_{i m}) .

Since

\tilde{y}

is independent of y, we apply large-deviation inequalities twice. First, conditional on

\tilde{y}

,

{\tilde{I}}_{1}^{*}

is a sum of

N_{i}

independent variables (randomness comes from

T_{i m}

’s). We apply the Bernstein inequality to get a large-deviation bound for

{\tilde{I}}_{1}^{*}

, which depends on a quantity

σ^{2} (\tilde{y})

. Next, since

σ^{2} (\tilde{y})

can also be written as a sum of

N_{i}

independent variables (randomness comes from

{\tilde{T}}_{i m}

’s), we apply the Bernstein inequality again to obtain a large-deviation bound for

σ^{2} (\tilde{y})

. Combining two steps gives the large-deviation bound for

{\tilde{I}}_{1}^{*}

.

Remark 4.

The decoupling inequality is employed multiple times to study other U-statistics-type quantities arising in our proof. For example, recall that

(G - G_{0})

decomposes into the sum of four matrices, and we have only discussed how to bound

∥ E_{4} ∥

. In the analysis of

∥ E_{2} ∥

and

∥ E_{3} ∥

, we need to bound other quadratic terms involving a sum over

(i, m)

, with

1 \leq i \leq n

and

1 \leq m \leq N_{i}

. In that case, we need a more general decoupling inequality. We refer readers to Theorem A3 in Appendix A for more details.

Remark 5.

The analysis in [4] uses an ϵ-net argument [27] and the martingale Bernstein inequality [30] to study

∥ E_{4} ∥

. In our analysis, we use the matrix Bernstein inequality [28], instead of the ϵ-net argument. The matrix Bernstein inequality enables us to tackle each quadratic term related to each i separately instead of handling complicated quadratic terms involving summation over i and m simultaneously. Additionally, we adopt the decoupling inequality for U-statistics [5,29], instead of the martingale Bernstein inequality, to study all the quadratic terms arising in our analysis. The decoupling inequality converts the tail anaysis of quadratic terms into tail analysis of (conditionally) independent sums. It provides sharper bounds when the variables have heavy tails (which is the case for the word counts in a topic model, especially when documents are short).

4.5. Proof Sketch of Theorem 1

We combine the non-stochastic perturbation result in Lemma 1 and the large-deviation bounds in Lemma 2 to prove Theorem 1. By Lemma A2,

| λ_{K} | \geq C^{- 1} n β_{n}

. It follows from Weyl’s inequality, the first claim in Lemma 2, and the assumption of

p {log}^{2} (n) \leq N n β_{n}^{2}

that with probability

1 - n^{- C_{1}}

:

| {\hat{λ}}_{K} | \geq | λ_{k} | \cdot [1 - O ({[log (n)]}^{- 1 / 2})] \geq C^{- 1} n β_{n} .

In addition, it can be shown (see Lemma A2) that

∥ e_{j}^{'} Ξ ∥ \leq C h_{j}^{1 / 2}

. Combining this with the two claims in Lemma 2 gives that with probability

1 - n^{- C_{1}}

:

∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) Ξ ∥ \leq C \sqrt{\frac{h_{j} n p log (n)}{N}} : = γ_{j} .

We hope to apply Lemma 1. This requires obtaining a bound for

∥ Γ^{- 1} (G - G_{0}) {Γ ∥}_{1 \to \infty}

. Since

Γ \propto H^{1 / 2}

, it suffices to study

∥ H^{- 1 / 2} (G - G_{0}) H^{1 / 2} ∥_{1 \to \infty}

. Similar to the analysis of

∥ e_{j}^{'} (G - G_{0}) Ξ ∥

, we can show (see the proofs of Lemmas A5 and A6, such as (A58)) that

∥ e_{j}^{'} (G - G_{0}) H^{1 / 2} ∥_{1} \leq C N^{- 1 / 2} {[h_{j} n p log (n)]}^{1 / 2} \leq C \sqrt{h_{j} / log (n)} \cdot n β_{n}

, where the last inequality is because of

p {log}^{2} (n) \leq N n

. We immediately have:

∥ H^{- 1 / 2} (G - G_{0}) H^{1 / 2} ∥_{1 \to \infty} = max_{j} \{h_{j}^{- 1 / 2} {∥ e_{j}^{'} (G - G_{0}) H^{- 1 / 2} ∥}_{1}\} \leq \frac{C n β_{n}}{\sqrt{log (n)}} \leq \frac{| {\hat{λ}}_{K} |}{2} .

We then apply Lemma 1 to get

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C | {\hat{λ}}_{K} |^{- 1} γ_{j} \leq C {(n β_{n})}^{- 1} γ_{j}

. The claim of Theorem 1 follows immediately by plugging in the value of

γ_{j}

as given above.

5. Summary and Discussion

The topic model imposes a “low-rank plus noise” structure on the data matrix. However, the noise is not simply additive; rather, it consists of centered multinomial random vectors. The eigenvector analysis in a topic model is more complex than standard eigenvector analysis for random matrices. Firstly, the entries of the data matrix are weakly dependent, making techniques such as leave-one-out inapplicable. Secondly, due to the significant word frequency heterogeneity in natural languages, entry-wise eigenvector analysis becomes much more nuanced, as different entries of the same eigenvector have significantly different bounds. Additionally, the data exhibit Bernstein-type tails, precluding the use of random matrix theory tools that assume sub-exponential entries. While we build on the analysis in [4], we address these challenges with new proof ideas. Our results provide the most precise eigenvector analysis and rate-optimality results for topic modeling, to the best of our knowledge.

A related but more ambitious goal is obtaining higher-order expansions of the empirical singular vectors. Since the random matrix under study in the topic model is the Wishart type, we can possibly borrow techniques in [31] to study the joint distribution of empirical singular values and singular vectors. In this paper, we assume the number of topics, K, is finite, but our analysis can be easily extended to the scenario of a growing K (e.g.,

K = O (log (n))

). We assume

min {p, N} \geq {log}^{3} (n)

. When

p < {log}^{3} (n)

, it becomes a low-dimensional eigenvector analysis problem, which is easy to tackle. When

N < {log}^{3} (n)

, it is the extremely short documents case (i.e., each document has only a finite length, say, fewer than 20, as in documents such as Tweets). We leave it to future work.

Author Contributions

Z.T.K. and J.W. developed the method and theory and wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation CAREER grant DMS-1943902.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Preliminary Lemmas and Theorems

In this section, we collect the preliminaries lemmas and theorems that will be used in the entry-wise eigenvector analysis. Under Assumption 3,

N_{i} ≍ \bar{N} ≍ N

. Therefore, throughout this section and subsequent sections, we always assume

\bar{N} = N

without loss of generality.

The first lemma describes the estimates of the entries in

M_{0}

and reveals its relation to the underlying frequency parameters, and further provides the large-deviation bound for the normalization matrix M.

Lemma A1

(Lemmas D.1 & E.1 in [4]). Recall the definitions

M = diag (n^{- 1} \sum_{i = 1}^{n} N d_{i} / N_{i})

,

M_{0} = diag (n^{- 1} \sum_{i = 1}^{n} N d_{i}^{0} / N_{i})

, and

h_{j} = \sum_{k = 1}^{K} A_{k} (j)

for

1 \leq j \leq p

. Suppose the conditions in Theorem 1 hold. Then:

\begin{matrix} M_{0} (j, j) ≍ h_{j}; a n d | M (j, j) - M_{0} (j, j) | \leq C \sqrt{\frac{h_{j} log (n)}{N n}}, \end{matrix}

for some constant

C > 0

, with probability

1 - o (n^{- 3})

, simultaneously for all

1 \leq j \leq p

. Furthermore, with probability

1 - o (n^{- 3})

,

\begin{matrix} ∥ M^{- 1 / 2} M_{0}^{1 / 2} - I_{p} ∥ \leq C \sqrt{\frac{p log (n)}{N n}} . \end{matrix}

(A1)

Remark A1.

In this lemma and other subsequent lemmas, “with probability

1 - o (n^{- 3})

” can always be replaced by “with probability

1 - n^{- C_{1}}

”, for an arbitrary constant

C_{1} > 0

. The small-probability events in these lemmas come from the Bernstein inequality or the matrix Bernstein inequality. These inequalities concern small-probability events associated with an arbitrary probability

δ \in (0, 1)

, and the high-probability bounds depend on

log (1 / δ)

. When

δ = n^{- C_{1}}

,

log (1 / δ) = C_{1} log (n)

. Therefore, changing

C_{1}

only changes the high-probability bound by a constant. Without loss of generality, we take

C_{1} = 4

for convenience.

The proof of the first statement is quite similar to the proof detailed in the supplementary materials of [4]. The only difference is the existence of the additional factor

N / N_{i}

. Thanks to the condition that

N_{i}

’s are at the same order, it is not hard to see that

M_{0} (j, j) ≍ n^{- 1} \sum_{i = 1}^{n} d_{i}^{0} (j)

,where the RHS is exactly the definition of

M_{0}

in [4]. Thus, the proof follows simply under Assumption 2. To obtain the large-deviation bound, the following representation is crucial:

M (j, j) - M_{0} (j, j) = \frac{1}{n} \sum_{i = 1}^{n} \frac{N}{N_{i}} (d_{i} (j) - d_{i}^{0} (j)) = \frac{1}{n} \sum_{i = 1}^{n} \frac{N}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} T_{i m} (j) - d_{i}^{0} (j),

where

{T_{i m}}_{m = 1}^{n}

are i.i.d. Multinomial

(1, d_{i}^{0})

with

d_{i}^{0} = A w_{i}

. The RHS is a sum of independent random variables, thus allowing the application of Bernstein inequality. The inequality (A1) is not provided in the supplementary materials of [4], but it follows easily from the first statement. We prove (A1) in detail below.

By definition, it suffices to claim that:

| \frac{\sqrt{M_{0} (j, j)}}{\sqrt{M (j, j)}} - 1 | \leq C \sqrt{\frac{p log (n)}{N n}}

simultaneously for all

1 \leq j \leq p

. To this end, we derive:

| \frac{\sqrt{M_{0} (j, j)}}{\sqrt{M (j, j)}} - 1 | \leq \frac{| M_{0} (j, j) - M (j, j) |}{\sqrt{M (j, j)} (\sqrt{M_{0} (j, j)} + \sqrt{M (j, j)})}

Using the large-deviation bound

| M (j, j) - M_{0} (j, j) | \leq C \sqrt{h_{j} log (n) / (N n)} = o (h_{j})

and also the estimate

M_{0} (j, j) ≍ h_{j}

, we bound the denominator by:

\sqrt{M (j, j)} (\sqrt{M_{0} (j, j)} + \sqrt{M (j, j)}) \geq C \sqrt{h_{j} - o (h_{j})} (\sqrt{h_{j}} + \sqrt{h_{j} - o (h_{j})}) \geq C h_{j}

with probability

1 - o (n^{- 3})

, simultaneously for all

1 \leq j \leq p

. Consequently:

| \frac{\sqrt{M_{0} (j, j)}}{\sqrt{M (j, j)}} - 1 | \leq C \sqrt{\frac{log (n)}{N n h_{j}}} \leq C \sqrt{\frac{p log (n)}{N n}},

where the last step is due to

h_{j} \geq h_{min} \geq C / p

. This completes the proof of (A1).

The next Lemma presents the eigen-properties of the population data matrix.

Lemma A2

(Lemmas F.2, F.3, and D.3 in [4]). Suppose the conditions in Theorem 1 hold. Let

G_{0}

be as in (13). Denote by

λ_{1} \geq λ_{1} \geq \dots \geq λ_{K}

the non-zero eigenvalues of

G_{0}

. There exists a constant

C > 1

such that:

C n β_{n} \leq λ_{k} \leq C n, for 2 \leq k \leq K, and λ_{1} \geq C^{- 1} n + max_{2 \leq k \leq K} λ_{K} .

Furthermore, let

ξ_{1}, ξ_{2}, \dots, ξ_{K}

be the associated eigenvectors of

G_{0}

. Then:

\begin{matrix} C^{- 1} \sqrt{h_{j}} \leq ξ_{1} (j) \leq C \sqrt{h_{j}}, ∥ e_{j}^{'} Ξ ∥ \leq C \sqrt{h_{j}} . \end{matrix}

The above lemma can be proved in the same manner as those in the supplement materials of [4]. Given our more general condition on

Σ_{A}

, which allows its smallest eigenvalue to converge to 0 as

n \to \infty

, the results on the eigenvalues are slightly different. In out setting, only the largest eigenvalue is of order n and it is well-separated from the others as the first eigenvector of

n^{- 1} G_{0}

has multiplicity one, which can be claimed by using Perron’s theorem and the last inequality in Assumption 2. For the other eigenvalues, they might be at the order of

β_{n}

in Assumption 2. The details are very similar to those in the supplement materials of [4] by adapting our relaxed condition on

Σ_{A}

, so we avoid redundant derivations here.

Throughout the analysis, we need matrix Bernstein inequality and decoupling inequality for U-statistics. For readers’ convenience, we provide the theorems below.

Theorem A1.

Let

X_{1}, \dots, X_{N}

be independent, mean zero,

n \times n

symmetric random matrices, such that

∥ X_{i} ∥ \leq b

almost surely for all i and

∥ \sum_{i = 1}^{N} E X_{i}^{2} ∥ \leq σ^{2}

. Then, for every

t \geq 0

, we have:

\begin{matrix} P (∥ \sum_{i = 1}^{N} X_{i} ∥ \geq t) \leq 2 n exp (- \frac{t^{2} / 2}{σ^{2} + b t / 3}) . \end{matrix}

The following two theorems are special cases of Theorem 3.4.1 in [29], which implies that using decoupling inequality simplifies the analysis of U-statistics to the study of sums of (conditionally) independent random variables.

Theorem A2.

Let

{X_{i}}_{i = 1}^{n}

be a sequence of i.i.d. random vectors in

R^{d}

, and let

{{\tilde{X}}_{i}}_{i = 1}^{n}

be an independent copy of

{X_{i}}_{i = 1}^{n}

. Then, there exists a constant

\tilde{C} > 0

independent of

n, d

such that:

\begin{matrix} P (| \sum_{i \neq j} X_{i}^{'} X_{j} | \geq t) \leq \tilde{C} P (\tilde{C} | \sum_{i \neq j} X_{i}^{'} {\tilde{X}}_{j} | \geq t) \end{matrix}

Theorem A3.

Let

{X_{m}^{(i)}}_{i, m}

, for

1 \leq i \leq n

and

1 \leq m \leq N

, be a sequence of i.i.d. random vectors in

R^{d}

, and let

{{\tilde{X}}_{m}^{(i)}}_{i, m}

be an independent copy of

{X_{m}^{(i)}}_{i, m}

. Suppose that

h : R^{2 d} \to R

is a measurable function. Then, there exists a constant

\bar{C} > 0

independent of

n, m, d

such that:

\begin{matrix} P (| \sum_{i} \sum_{m \neq m_{1}} h (X_{m}^{(i)}, X_{m_{1}}^{(i)}) | \geq t) \leq \bar{C} P (\bar{C} | \sum_{i} \sum_{m \neq m_{1}} h (X_{m}^{(i)}, {\tilde{X}}_{m_{1}}^{(i)}) | \geq t) \end{matrix}

The key difference between the above theorems is attributed to the index set used across the sum. In Theorem A2, the random variables are indexed by i and all pairs of

(X_{i}, X_{j})

are included; in contrast, Theorem A3 uses both i and m and consider only the pairs that share the identical index i. However, both are viewed as special cases of Theorem 3.4.1 with degree 2 in [29], which discussed a broader sequence of functions

{h_{i j} (\cdot, \cdot)}_{i, j}

, where each

h_{i j} (\cdot, \cdot)

can differ with varying

i, j

. By assigning all

h_{i j} (\cdot, \cdot)

to the same product function, we have Theorem A2; whereas Theorem A3 follows from specifying:

h_{(i m) (j m_{1})} (\cdot, \cdot) = \{\begin{matrix} h (\cdot, \cdot), & i f i = j; \\ 0, & o t h e r w i s e . \end{matrix}

Appendix B. Proofs of Lemmas 1 and 2

Appendix B.1. Proof of Lemma 1

Using the definition of eigenvectors and eigenvalues, we have

G \hat{Ξ} = \hat{Ξ} \hat{Λ}

and

G_{0} Ξ = Ξ Λ

. Additionally, since

G_{0}

has a rank K,

G_{0} = Ξ Λ Ξ^{'}

. It follows that:

\hat{Ξ} \hat{Λ} = [G_{0} + (G - G_{0})] \hat{Ξ} = Ξ Λ Ξ^{'} \hat{Ξ} + (G - G_{0}) \hat{Ξ} = Ξ Ξ^{'} G_{0} \hat{Ξ} + (G - G_{0}) \hat{Ξ} .

As a result:

e_{j}^{'} \hat{Ξ} = e_{j}^{'} Ξ Ξ^{'} G_{0} \hat{Ξ} {\hat{Λ}}^{- 1} + e_{j}^{'} (G - G_{0}) \hat{Ξ} {\hat{Λ}}^{- 1} .

(A2)

Note that

G_{0} \hat{Ξ} = G \hat{Ξ} + (G_{0} - G) \hat{Ξ} = \hat{Ξ} \hat{Λ} + (G_{0} - G) \hat{Ξ}

. We plug this equality into the first term on the RHS of (A2) to obtain:

\begin{matrix} e_{j}^{'} Ξ Ξ^{'} G_{0} \hat{Ξ} {\hat{Λ}}^{- 1} & = e_{j}^{'} Ξ Ξ^{'} \hat{Ξ} + e_{j}^{'} Ξ Ξ^{'} (G_{0} - G) \hat{Ξ} {\hat{Λ}}^{- 1} \\ = e_{j}^{'} Ξ O^{'} + e_{j}^{'} Ξ (Ξ^{'} \hat{Ξ} - O^{'}) + e_{j}^{'} Ξ Ξ^{'} (G_{0} - G) \hat{Ξ} {\hat{Λ}}^{- 1}, \end{matrix}

for any orthogonal matrix O. Combining this with (A2) gives:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq ∥ e_{j}^{'} Ξ (Ξ^{'} \hat{Ξ} - O^{'}) ∥ + ∥ e_{j}^{'} Ξ Ξ^{'} (G_{0} - G) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ + ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ .

(A3)

Fix

O = sgn ({\hat{Ξ}}^{'} Ξ)

. The sine-theta theorem [18] yields:

∥ Ξ^{'} \hat{Ξ} - O^{'} ∥ \leq | {\hat{λ}}_{K} |^{- 2} {∥ G - G_{0} ∥}^{2} .

(A4)

We use (A4) to bound the first two terms on the RHS of (A3):

\begin{matrix} ∥ e_{j}^{'} Ξ (Ξ^{'} \hat{Ξ} - O^{'}) ∥ & \leq ∥ e_{j}^{'} Ξ ∥ ∥ Ξ^{'} \hat{Ξ} - O^{'} ∥ \leq ∥ e_{j}^{'} Ξ ∥ \cdot | {\hat{λ}}_{K} |^{- 2} {∥ G - G_{0} ∥}^{2}, \\ ∥ e_{j}^{'} Ξ Ξ^{'} (G_{0} - G) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ & \leq ∥ e_{j}^{'} Ξ ∥ \cdot | {\hat{λ}}_{K} |^{- 1} ∥ Ξ^{'} (G_{0} - G) \hat{Ξ} ∥ \leq ∥ e_{j}^{'} Ξ ∥ \cdot | {\hat{λ}}_{K} |^{- 1} ∥ G - G_{0} ∥ . \end{matrix}

Since

∥ G - G_{0} ∥ \leq (1 - c_{0}) | {\hat{λ}}_{K} |

, the RHS in the second line above dominates the RHS in the first line. We plug these upper bounds into (A3) to get:

\begin{matrix} ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq | {\hat{λ}}_{K} |^{- 1} ∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ \\ \leq | {\hat{λ}}_{K} |^{- 1} (∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥) . \end{matrix}

(A5)

We notice that the second term on the RHS of (A5) still involves

\hat{Ξ}

, and we further bound this term. By the assumption of this theorem, there exists a diagonal matrix

Γ

such that

∥ Γ^{- 1} (G - G_{0}) {Γ ∥}_{1 \to \infty} \leq (1 - c_{0}) | {\hat{λ}}_{K} |

. It implies:

∥ e_{j}^{'} (G - G_{0}) {Γ ∥}_{1} \leq (1 - c_{0}) γ_{j} | {\hat{λ}}_{K} | .

Additionally, for any vector

v \in R^{p}

and matrix

B \in R^{p \times K}

, it holds that

∥ v^{'} B ∥ \leq \sum_{j} | v_{j} | ∥ e_{j}^{'} B ∥

\leq \sum_{j} | v_{j} {| ∥ B ∥}_{2 \to \infty} \leq {∥ v ∥}_{1} {∥ B ∥}_{2 \to \infty}

. We then bound the second term on the RHS of (A5) as follows:

\begin{matrix} ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥ \leq ∥ e_{j}^{'} (G - G_{0}) Ξ O^{'} ∥ + ∥ e_{j}^{'} (G - G_{0}) (\hat{Ξ} - Ξ O^{'}) ∥ \\ \leq ∥ e_{j}^{'} (G - G_{0}) Ξ ∥ + ∥ e_{j}^{'} (G - G_{0}) {Γ ∥}_{1} \cdot {∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥}_{2 \to \infty} \\ \leq ∥ e_{j}^{'} (G - G_{0}) Ξ ∥ + (1 - c_{0}) γ_{j} | {\hat{λ}}_{K} | \cdot ∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} . \end{matrix}

(A6)

Plugging (A6) into (A5) gives:

\begin{matrix} ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq | {\hat{λ}}_{K} |^{- 1} (∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) Ξ ∥) \\ + (1 - c_{0}) γ_{j} \cdot {∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥}_{2 \to \infty} \\ \leq | {\hat{λ}}_{K} |^{- 1} γ_{j} + (1 - c_{0}) γ_{j} \cdot {∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥}_{2 \to \infty}, \end{matrix}

(A7)

where in the last line we have used the assumption that

γ_{j}

is an upper bound for

∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ + ∥ e_{j}^{'} (G - G_{0}) Ξ ∥

. Note that

∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} = {max}_{1 \leq j \leq p} \{γ_{j}^{- 1} ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥\}

. We multiply both LSH and RSH of (A7) by

γ_{j}^{- 1}

and take the maximum over j. It gives:

∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \leq | {\hat{λ}}_{K} |^{- 1} + (1 - c_{0}) {∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥}_{2 \to \infty},

(A8)

or equivalently,

∥ Γ^{- 1} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \leq c_{0}^{- 1} {| {\hat{λ}}_{K} |}^{- 1}

. We further plug this inequality into (A7) to obtain:

∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq | λ_{K} |^{- 1} γ_{j} + (1 - c_{0}) \cdot c_{0}^{- 1} | λ_{K} |^{- 1} γ_{j} \leq c_{0}^{- 1} {| λ_{K} |}^{- 1} γ_{j} .

(A9)

This proves the claim. □

Appendix B.2. Proof of Lemma 2

The first claim is the same as the one in Lemma A3 and will be proved there.

The second claim follows by simply collecting arguments in the proof of Lemma A3, as shown below: By (A24),

G - G_{0} = E_{1} + E_{2} + E_{3} + E_{4}

. It follows that:

∥ e_{j}^{'} (G - G_{0}) Ξ ∥ \leq \sum_{s = 1}^{4} ∥ e_{j}^{'} E_{s} Ξ ∥ .

(A10)

We apply Lemma A5 to get large-deviation bounds for

∥ e_{j}^{'} E_{s} Ξ ∥

with

s \in {2, 3, 4}

. This lemma concerns

∥ e_{j}^{'} E_{s} \hat{Ξ} ∥

, but in its proof we have already analyzed

∥ e_{j}^{'} E_{s} Ξ ∥

. In particular,

∥ e_{j}^{'} E_{2} Ξ ∥

and

∥ e_{j}^{'} E_{3} Ξ ∥

have the same bounds as in (A29), and the bound for

∥ e_{j}^{'} E_{4} Ξ ∥

only has the first term in (A30). In summary:

∥ e_{j}^{'} E_{s} Ξ ∥ \leq C \sqrt{\frac{h_{j} n p log (n)}{N}}, for s \in {2, 3, 4} .

(A11)

It remains to bound

∥ e_{j}^{'} E_{1} Ξ ∥

. We first mimic the steps of proving (A33) of Lemma A5 (more specifically, the derivation of (A63), except that

\hat{Ξ}

is replaced by

Ξ

) to obtain:

\begin{matrix} ∥ e_{j} E_{1} Ξ ∥ & \leq C n ∥ e_{j}^{'} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ + C ∥ e_{j}^{'} G_{0} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ \\ + \sum_{s = 2}^{4} ∥ e_{j}^{'} E_{s} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ . \end{matrix}

(A12)

We note that:

\begin{matrix} ∥ e_{j}^{'} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ & \leq ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥ \cdot ∥ e_{j}^{'} Ξ ∥, \\ ∥ e_{j}^{'} G_{0} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ & = ∥ e_{j}^{'} Ξ Λ Ξ^{'} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ \\ \leq ∥ e_{j}^{'} Ξ ∥ \cdot ∥ Λ ∥ \cdot ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥, \\ ∥ e_{j}^{'} E_{s} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ & \leq ∥ e_{j}^{'} E_{s} ∥ \cdot ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥ . \end{matrix}

For

s \in {2, 3}

, we have

∥ e_{j}^{'} E_{s} ∥ \leq C \sqrt{h_{j} p log (n) / (N n)}

. This has been derived in the proof of Lemma A5: when controlling

∥ e_{j}^{'} E_{2} Ξ ∥

and

∥ e_{j}^{'} E_{3} Ξ ∥

there, we first bound them by

∥ e_{j}^{'} E_{2} ∥

and

∥ e_{j}^{'} E_{3} ∥

, respectively, and then study

∥ e_{j}^{'} E_{2} ∥

and

∥ e_{j}^{'} E_{3} ∥

directly). We plug these results into (A12) to obtain:

\begin{matrix} ∥ e_{j} E_{1} Ξ ∥ \leq ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥ (n ∥ e_{j}^{'} Ξ ∥ + | λ_{1} | ∥ e_{j}^{'} Ξ ∥ + C \sqrt{\frac{h_{j} n p log (n)}{N}}) \\ + ∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ . \end{matrix}

(A13)

For

∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥

, we cannot use the same idea to bound it as for

s \in {2, 3}

, because the bound for

∥ e_{j}^{'} E_{4} ∥

is much larger than those for

∥ e_{j}^{'} E_{2} ∥

and

∥ e_{j}^{'} E_{4} ∥

. Instead, we study

∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥

directly. This part is contained in the proof of Lemma A6; specifically, in the proof of (A31). There we have shown:

∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) Ξ ∥ \leq C \sqrt{h_{j}} \cdot \frac{p log (n)}{N} .

(A14)

We plug (A14) into (A13) and note that

λ_{1} = O (n)

and

∥ e_{j}^{'} Ξ ∥ = O (h_{j}^{1 / 2})

(by Lemma A2). We also use the assumption that

N n \geq N n β_{n}^{2} \geq p {log}^{2} (n)

and the bound for

∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥

in (A1). It follows that

\begin{matrix} ∥ e_{j} E_{1} Ξ ∥ \leq ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥ \cdot C \sqrt{h_{j}} (n + \sqrt{\frac{n p log (n)}{N}} + \frac{p log (n)}{N}) \\ \leq ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥ \cdot O (n h_{j}^{1 / 2}) \leq C \sqrt{\frac{h_{j} n p log (n)}{N}} . \end{matrix}

(A15)

We plug (A11) and (A15) into (A10). This proves the second claim. □

Appendix C. The Complete Proof of Theorem 1

A proof sketch of Theorem 1 has been given in Section 4.4. For the ease of writing formal proofs, we have re-arranged the claims and analyses in Lemmas 1 and 2, so the proof structure here is slightly different from the sketch in Section 4.4. For example, Lemma A3 combines the claims of Lemma 2 with some steps in proving Lemma 1; the remaining steps in the proof of Lemma 1 are combined into the proof of the main theorem.

First, we present a key technical lemma. The proof of this lemma is quite involved and relegated to Appendix D.1.

Lemma A3.

Under the setting of Theorem 1. Recall

G, G_{0}

in (13). With probability

1 - o (n^{- 3})

:

\begin{matrix} (A16) & ∥ G - G_{0} ∥ \leq C \sqrt{\frac{p n log (n)}{N}} ≪ n β_{n}; \\ (A17) & ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥ / n \leq C \sqrt{\frac{h_{j} p log (n)}{n N}} (1 + ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) + o (β_{n}) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥, \end{matrix}

simultaneously for all

1 \leq j \leq p

.

Next, we use Lemma A3 to prove Theorem 1. Let

({\hat{λ}}_{k}, {\hat{ξ}}_{k})

and

({\hat{λ}}_{k}, {\hat{ξ}}_{k})

be the k-th eigen-pairs of G and

G_{0}

, respectively. Let

\hat{Λ} = diag ({\hat{λ}}_{1}, {\hat{λ}}_{2}, \dots, {\hat{λ}}_{K})

and

Λ = diag (λ_{1}, λ_{2}, \dots, λ_{K})

. Following (A2) and (A3), we have:

\begin{matrix} ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq ∥ e_{j}^{'} Ξ (Ξ^{'} \hat{Ξ} - O^{'}) ∥ + ∥ e_{j}^{'} Ξ Ξ^{'} (G_{0} - G) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ + ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ . \end{matrix}

(A18)

In the sequel, we bound the three terms on the RHS above one-by-one.

First, by sine-theta theorem:

\begin{matrix} ∥ e_{j}^{'} Ξ (Ξ^{'} \hat{Ξ} - O^{'}) ∥ \leq C ∥ e_{j}^{'} Ξ ∥ \frac{∥ G - G_{0} ∥^{2}}{| {\hat{λ}}_{K} - λ_{K + 1} |^{2}} . \end{matrix}

For

1 \leq k \leq p

, by Weyl’s inequality:

\begin{matrix} | {\hat{λ}}_{k} - λ_{k} | \leq ∥ G - G_{0} ∥ ≪ n β_{n} \end{matrix}

(A19)

with probability

1 - o (n^{- 3})

, by employing (A16) in Lemma A3. In particular,

λ_{1} ≍ n

and

C n β_{n} < λ_{k} \leq C n

for

2 \leq k \leq K

and

λ_{k} = 0

otherwise (see Lemma A2). Thereby,

| {\hat{λ}}_{K} - λ_{K + 1} | \geq C n β_{n}

. Further using

∥ e_{j}^{'} Ξ ∥ \leq C \sqrt{h_{j}}

(see Lemma A2), with the aid of Lemma A3, we obtain that with probability

1 - o (n^{- 3})

:

\begin{matrix} ∥ e_{j}^{'} Ξ (Ξ^{'} \hat{Ξ} - O^{'}) ∥ \leq C \sqrt{h_{j}} \cdot \frac{p log (n)}{N n β_{n}^{2}} \end{matrix}

(A20)

simultaneously for all

1 \leq j \leq p

.

Next, we similarly bound the second term:

\begin{matrix} ∥ e_{j}^{'} Ξ Ξ^{'} (G_{0} - G) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ \leq \frac{C}{n β_{n}} ∥ e_{j}^{'} Ξ ∥ ∥ G - G_{0} ∥ \leq C \sqrt{\frac{h_{j} p log (n)}{N n β_{n}^{2}}} . \end{matrix}

(A21)

Here we used the fact that

{\hat{λ}}_{K} \geq C n β_{n}

following from (A19) and Lemma A2.

For the last term, we simply bound:

\begin{matrix} ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} {\hat{Λ}}^{- 1} ∥ \leq C ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥ / (n β_{n}) . \end{matrix}

(A22)

Combining (A20), (A21), and (A22) into (A18), by (A17) in Lemma A3, we arrive at:

\begin{matrix} ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C \sqrt{\frac{h_{j} p log (n)}{N n β_{n}^{2}}} (1 + ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) + o (1) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

Rearranging both sides above gives:

\begin{matrix} ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \leq C \sqrt{\frac{h_{j} p log (n)}{N n β_{n}^{2}}} (1 + ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}), \end{matrix}

(A23)

with probability

1 - o (n^{- 3})

, simultaneously for all

1 \leq j \leq p

.

To proceed, we multiply both sides in (A23) by

h_{j}^{- 1 / 2}

and take the maximum. It follows that:

\begin{matrix} ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} (1 + ∥ H_{0}^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) . \end{matrix}

Note that

\sqrt{p log (n)} / \sqrt{N n β_{n}^{2}} = o (1)

from Assumption 3. We further rearrange both sides above and get:

∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \leq \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} = o (1) .

Plugging the above estimate into (A23), we finally conclude the proof of Theorem 1.□

Appendix D. Entry-Wise Eigenvector Analysis and Proof of Lemma A3

To finalize the proof of Theorem 1 as outlined in Appendix C, the remaining task is to prove Lemma A3.

Recall the definition in (13) that:

G = M^{- \frac{1}{2}} D D^{'} M^{- \frac{1}{2}} - \frac{n}{N} I_{p}, G_{0} = M_{0}^{- \frac{1}{2}} [\sum_{i = 1}^{n} (1 - N_{i}^{- 1}) d_{i}^{0} {(d_{i}^{0})}^{'}] M_{0}^{- \frac{1}{2}} .

Write

D = D_{0} + Z

, where

Z = (z_{1}, z_{2}, \dots, z_{n})

is a mean-zero random matrix with each

N z_{i}

being centered Multinomial

(N_{i}, A w_{i})

. By this representation, we decompose the perturbation matrix

G - G_{0}

as follows:

\begin{matrix} G - G_{0} & = M^{- \frac{1}{2}} D D^{'} M^{- \frac{1}{2}} - M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}} + M_{0}^{- \frac{1}{2}} (D D^{'} - \sum_{i = 1}^{n} (1 - N_{i}^{- 1}) d_{i}^{0} {(d_{i}^{0})}^{'} - \frac{n}{N} M_{0}) M_{0}^{- \frac{1}{2}} \\ = (M^{- \frac{1}{2}} D D^{'} M^{- \frac{1}{2}} - M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}}) + M_{0}^{- \frac{1}{2}} Z D_{0}^{'} M_{0}^{- \frac{1}{2}} + M_{0}^{- \frac{1}{2}} D_{0} Z^{'} M_{0}^{- \frac{1}{2}} \\ + M_{0}^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) M_{0}^{- \frac{1}{2}} \\ = E_{1} + E_{2} + E_{3} + E_{4}, \end{matrix}

(A24)

where:

\begin{matrix} E_{1} : = M^{- \frac{1}{2}} D D^{'} M^{- \frac{1}{2}} - M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}}, \\ E_{2} : = M_{0}^{- \frac{1}{2}} Z D_{0}^{'} M_{0}^{- \frac{1}{2}}, E_{3} : = M_{0}^{- \frac{1}{2}} D_{0} Z^{'} M_{0}^{- \frac{1}{2}} \\ E_{4} : = M_{0}^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) M_{0}^{- \frac{1}{2}} . \end{matrix}

(A25)

Here the second step of (A24) is due to the identity:

\begin{matrix} E (Z Z^{'}) + \sum_{i = 1}^{n} N_{i}^{- 1} d_{i}^{0} {(d_{i}^{0})}^{'} - \frac{n}{N} M_{0} = 0, \end{matrix}

which can be obtained by:

E (Z Z^{'}) = \sum_{i = 1}^{n} E z_{i} z_{i}^{'} = \sum_{i = 1}^{n} N_{i}^{- 2} \sum_{m, s = 1}^{N_{i}} E (T_{i m} - E T_{i m}) {(T_{i s} - E T_{i s})}^{'},

with

{T_{i m}}_{m = 1}^{N}

being i.i.d. Multinomial

(1, A w_{i})

.

Throughout the analysis in this section, we will frequently rewrite and use:

\begin{matrix} z_{i} = \frac{1}{N_{i}} \sum_{m = 1}^{N_{i}} T_{i m} - E T_{i m} \end{matrix}

(A26)

as it introduces the sum of independent random variables. We use the notation

d_{i}^{0} : = E d_{i} = E T_{i m} = A w_{i}

for simplicity.

By (A24), in order to prove Lemma A3, it suffices to study:

∥ E_{s} ∥ and ∥ e_{j}^{'} E_{s} \hat{Ξ} ∥ / n, for s = 1, 2, 3, 4 and 1 \leq j \leq p .

The estimates for the aforementioned quantities are provided in the following technical lemmas, whose proofs are deferred to later sections.

Lemma A4.

Suppose the conditions in Theorem 1 hold. There exists a constant

C > 0

, such that with probability

1 - o (n^{- 3})

:

\begin{matrix} (A27) & ∥ E_{s} ∥ \leq C \sqrt{\frac{p n log (n)}{N}}, for s = 1, 2, 3 \\ (A28) & ∥ E_{4} ∥ = ∥ M_{0}^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) M_{0}^{- \frac{1}{2}} ∥ \leq C max \{\sqrt{\frac{p n log (n)}{N^{2}}}, \frac{p log (n)}{N}\} . \end{matrix}

Lemma A5.

Suppose the conditions in Theorem 1 hold. There exists a constant

C > 0

, such that with probability

1 - o (n^{- 3})

, simultaneously for all

1 \leq j \leq p

:

\begin{matrix} (A29) & ∥ e_{j}^{'} E_{s} \hat{Ξ} ∥ / n \leq C \sqrt{\frac{h_{j} p log (n)}{N n}}, for s = 2, 3 \\ (A30) & ∥ e_{j}^{'} E_{4} \hat{Ξ} ∥ / n \leq C \sqrt{\frac{h_{j} p log (n)}{N n}} (1 + ∥ H_{0}^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}), \end{matrix}

with

O = sgn ({\hat{Ξ}}^{'} Ξ)

.

Lemma A6.

Suppose the conditions in Theorem 1 hold. There exists a constant

C > 0

, such that with probability

1 - o (n^{- 3})

, simultaneously for all

1 \leq j \leq p

:

\begin{matrix} (A31) & ∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n \leq C \sqrt{h_{j}} \cdot \frac{p log (n)}{n N} (1 + ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}), \\ (A32) & ∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ \leq C \sqrt{\frac{log (n)}{N n}} + o (β_{n}) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥; \end{matrix}

and furthermore:

\begin{matrix} ∥ e_{j}^{'} E_{1} \hat{Ξ} ∥ / n \leq C \sqrt{\frac{h_{j} p log (n)}{N n}} (1 + ∥ H_{0}^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) + o (β_{n}) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

(A33)

For proving Lemmas A4 and A5, the difficulty lies in showing (A28) and (A30) as the quantity

E_{4}

involves the quadratic terms of Z with its dependence on

\hat{Ξ}

. We overcome the hurdle by decomposing

\hat{Ξ} = Ξ + \hat{Ξ} - Ξ O^{'}

and employing decoupling techniques (Theorems A2 and A3). Considering the expression of

E_{1}

, where

D D^{'}

is involved, the proof of (A33) in Lemma A6 significantly rely on the estimates in Lemma A5, together with (A31) and (A32). The detailed proofs are systematically presented in subsequent sections, following the proof of Lemma A3.

Appendix D.1. Proof of Lemma A3

We employ the technical lemmas (Lemmas A4–A6) to prove Lemma A3. We start with (A16). By the representation (A24), it is straightforward to obtain that:

∥ G - G_{0} ∥ \leq \sum_{s = 1}^{4} ∥ E_{s} ∥ \leq C \sqrt{\frac{p n log (n)}{N}} + C max \{\sqrt{\frac{p n log (n)}{N^{2}}}, \frac{p log (n)}{N}\}

for some constant

C > 0

, with probability

1 - o (n^{- 3})

. Under Assumption 3, it follows that:

\sqrt{\frac{p n log (n)}{N^{2}}} ≪ \sqrt{\frac{p n log (n)}{N}}, \frac{p log (n)}{N} = \sqrt{\frac{p n log (n)}{N}} \cdot \sqrt{\frac{p log (n)}{N n}} ≪ \sqrt{\frac{p n log (n)}{N}}

and:

\sqrt{\frac{p n log (n)}{N}} = n \cdot \sqrt{\frac{p log (n)}{N n}} ≪ n .

Therefore, we complete the proof of (A16).

Next, we show (A17). Similarly, using (A27), (A30), and (A33), we have:

\begin{matrix} ∥ e_{j}^{'} (G - G_{0}) \hat{Ξ} ∥ / n & \leq \sum_{s = 1}^{4} ∥ e_{j}^{'} E_{s} \hat{Ξ} ∥ / n \\ \leq C \sqrt{\frac{h_{j} p log (n)}{N n}} (1 + ∥ H_{0}^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) + o (β_{n}) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

This concludes the proof of Lemma A3.□

Appendix D.2. Proof of Lemma A4

We examine each

∥ E_{i} ∥

for

i = 1, 2, 3, 4

. We start with the easy one,

∥ E_{2} ∥

. Recall

D_{0} = A W

. We denote by

W_{k}^{'}

the k-th row of W and rewrite

W = {(W_{1}, \dots, W_{K})}^{'}

. Similarly, we use

Z_{j}^{'}

,

1 \leq j \leq p

to denote j-th row of Z. Thereby,

Z = (z_{1}, z_{2}, \dots, z_{n}) = {(Z_{1}, Z_{2}, \dots, Z_{p})}^{'}

. By the definition that

E_{2} = M_{0}^{- 1 / 2} Z D_{0}^{'} M_{0}^{- 1 / 2}

, we have:

\begin{matrix} ∥ E_{2} ∥ & = ∥ M_{0}^{- 1 / 2} Z W^{'} A^{'} M_{0}^{- 1 / 2} ∥ = ∥ \sum_{k = 1}^{K} M_{0}^{- 1 / 2} Z W_{k} \cdot A_{k}^{'} M_{0}^{- 1 / 2} ∥ \\ \leq \sum_{k = 1}^{K} ∥ M_{0}^{- 1 / 2} Z W_{k} ∥ \cdot ∥ A_{k}^{'} M_{0}^{- 1 / 2} ∥ . \end{matrix}

(A34)

We analyze each factor in the summand:

\begin{matrix} ∥ M_{0}^{- 1 / 2} Z W_{k} ∥^{2} = \sum_{j = 1}^{p} \frac{1}{M_{0} (j, j)} {(Z_{j}^{'} W_{k})}^{2}, ∥ A_{k}^{'} M_{0}^{- 1 / 2} ∥ ≍ ∥ A_{k}^{'} H^{- 1} A_{k} ∥^{1 / 2} \leq C, \end{matrix}

(A35)

where we used the fact that

A_{k} (j) \leq h_{j}

for

1 \leq j \leq p

. Hence, what remains is to prove a high-probability bound for each

Z_{j}^{'} W_{k}

. By the representation (A26):

\begin{matrix} Z_{j}^{'} W_{k} = \sum_{i = 1}^{n} z_{i} (j) w_{i} (k) = \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 1} w_{i} (k) (T_{i m} (j) - d_{i}^{0} (j)) . \end{matrix}

We then apply Bernstein inequality to the RHS above. By straightforward computations:

\begin{matrix} var (Z_{j}^{'} W_{k}) & = \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 2} w_{i} {(k)}^{2} E {(T_{i m} (j) - d_{i}^{0} (j))}^{2} \\ \leq \sum_{i = 1}^{n} N_{i}^{- 1} w_{i} {(k)}^{2} d_{i}^{0} (j) \leq \frac{h_{j} n}{N}, \end{matrix}

and the individual bound for each summand is

C / N

. Then, one can conclude from Bernstein inequality that with probability

1 - o (n^{- 3 - c_{0}})

:

\begin{matrix} | Z_{j}^{'} W_{k} | \leq C \sqrt{n h_{j} log (n) / N} + log (n) / N . \end{matrix}

(A36)

As a result, considering all

1 \leq j \leq p

, under

p n^{- c_{0}} \leq C

from Assumption 3, we have:

\begin{matrix} ∥ M_{0}^{- \frac{1}{2}} Z W_{k} ∥^{2} \leq C \sum_{j = 1}^{p} h_{j}^{- 1} \cdot (\frac{n h_{j} log (n)}{N} + \frac{log {(n)}^{2}}{N^{2}}) \leq C \frac{n p log (n)}{N} \end{matrix}

(A37)

with probability

1 - o (n^{- 3})

. Here, in the first step, we used

M_{0} (j, j) ≍ h_{j}

; the last step is due to the conditions

h_{j} \geq h_{min} \geq C / p

and

p log (n) ≪ N n

. Plugging (A37) and (A35) into (A34) gives:

\begin{matrix} ∥ E_{2} ∥ \leq C \sqrt{\frac{n p log (n)}{N}} . \end{matrix}

(A38)

Furthermore, by definition,

E_{3} = E_{2}^{'}

and

∥ E_{3} ∥ = ∥ E_{2} ∥

. Therefore, we directly conclude the upper bound for

∥ E_{3} ∥

.

Next, we study

E_{4}

and prove (A28). Notice that

M_{0} (j, j) ≍ h_{j}

for all

1 \leq j \leq p

. It suffices to prove:

\begin{matrix} ∥ H^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) H^{- \frac{1}{2}} ∥ \leq C max \{\sqrt{\frac{p n log (n)}{N^{2}}}, \frac{p log (n)}{N}\} . \end{matrix}

(A39)

We prove (A39) by employing Matrix Bernstein inequality (i.e., Theorem A1) and decoupling techniques (i.e., Theorem A2). First, write:

\begin{matrix} H^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) H^{- \frac{1}{2}} & = \sum_{i = 1}^{n} (H^{- \frac{1}{2}} z_{i}) {(H^{- \frac{1}{2}} z_{i})}^{'} - E (H^{- \frac{1}{2}} z_{i}) {(H^{- \frac{1}{2}} z_{i})}^{'} \\ = : n \cdot \sum_{i = 1}^{n} \frac{1}{n} ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} - E {\tilde{z}}_{i} {\tilde{z}}_{i}^{'}) \\ = : n \cdot \sum_{i = 1}^{n} X_{i} \end{matrix}

In order to get sharp bound, we employ the truncation idea by introducing:

\begin{matrix} {\tilde{X}}_{i} : = \frac{1}{n} ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}} - E {\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}), where E_{i} : = {{\tilde{z}}_{i}^{'} {\tilde{z}}_{i} \leq C p / N}, \end{matrix}

for some sufficiently large

C > 0

that depends on

C_{0}

(see Assumption 3) and

1_{E_{i}}

represents the indicator function. We then have:

\begin{matrix} n \sum_{i = 1}^{n} X_{i} = n \sum_{i = 1}^{n} {\tilde{X}}_{i} - \sum_{i = 1}^{n} E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}^{c}}) \end{matrix}

(A40)

under the event

⋂_{i = 1}^{n} E_{i}

. We will prove the large-deviation bound of

H^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) H^{- \frac{1}{2}}

in the following steps.

(a): We claim that:

$\begin{matrix} P (⋂_{i = 1}^{n} E_{i}) \leq 1 - \sum_{i = 1}^{n} P (E_{i}^{c}) = 1 - o (n^{- (2 C_{0} + 3)}) . \end{matrix}$
(b): We claim that under the event $⋂_{i = 1}^{n} E_{i}$ :

$∥ n \sum_{i = 1}^{n} X_{i} - n \sum_{i = 1}^{n} {\tilde{X}}_{i} ∥ = o (n^{- (C_{0} + 1)}) .$
(c): We aim to derive a high probability bound of $n \sum_{i = 1}^{n} {\tilde{X}}_{i}$ by Matrix Bernstein inequality (i.e., Theorem A1). We show that with probability $1 - o (n^{- 3})$ , for some large $C > 0$ :

$\begin{matrix} ∥ \sum_{i = 1}^{n} {\tilde{X}}_{i} ∥ \leq C max \{\sqrt{\frac{p log (n)}{n N^{2}}}, \frac{p log (n)}{n N}\} . \end{matrix}$

If (a)–(c) are claimed, with the condition that

N < C n^{- C_{0}}

from Assumption 3, it is straightforward to conclude that:

\begin{matrix} ∥ H^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) H^{- \frac{1}{2}} ∥ & = n ∥ \sum_{i = 1}^{n} {\tilde{X}}_{i} ∥ + o (n^{- C_{0}}) \\ \leq C max \{\sqrt{\frac{p n log (n)}{N^{2}}}, \frac{p log (n)}{N}\}, \end{matrix}

with probability

1 - o (n^{- 3})

. This gives (A28), except that we still need to verify (a)–(c).

In the sequel, we prove (a), (b) and (c) separately. To prove (a), it suffices to show that

P (E_{i}^{c}) = o (n^{- (2 C_{0} + 4)})

for all

1 \leq i \leq n

. By definition, for any fixed i,

N_{i} z_{i}

is centered multinomial with

N_{i}

trials. Therefore, we can represent:

\begin{matrix} z_{i} = \frac{1}{N_{i}} \sum_{m = 1}^{N_{i}} (T_{i m} - E T_{i m}), where T_{i m} ’ s are i . i . d . multinomial (1, d_{i}^{0}) for fixed i, \end{matrix}

(A41)

Then it can be computed that:

\begin{matrix} E ({\tilde{z}}_{i}^{'} {\tilde{z}}_{i}) = E z_{i}^{'} H^{- 1} z_{i} & = \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} E {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) \\ = \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} \sum_{t = 1}^{p} E {(T_{i m} (t) - d_{i}^{0} (t))}^{2} h_{t}^{- 1} \\ = \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} \sum_{t = 1}^{p} d_{i}^{0} (t) (1 - d_{i}^{0} (t)) h_{t}^{- 1} \leq \frac{p}{N_{i}} . \end{matrix}

(A42)

We write:

\begin{matrix} {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} - E ({\tilde{z}}_{i}^{'} {\tilde{z}}_{i}) = z_{i}^{'} H^{- 1} z_{i} - E z_{i}^{'} H^{- 1} z_{i} = I_{1} + I_{2}, \end{matrix}

(A43)

where:

\begin{matrix} I_{1} : = \frac{1}{N_{i}^{2}} \sum_{m_{1} \neq m_{2}}^{N_{i}} {(T_{i m_{1}} - E T_{i m_{1}})}^{'} H^{- 1} (T_{i m_{2}} - E T_{i m_{2}}), \\ I_{2} : = \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) - E {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) . \end{matrix}

First, we study

I_{1}

. Let

{{\tilde{T}}_{i m}}_{m = 1}^{N}

be an independent copy of

{T_{i m}}_{m = 1}^{N}

and:

\begin{matrix} {\tilde{I}}_{1} : = \frac{1}{N_{i}^{2}} \sum_{m_{1} \neq m_{2}}^{N_{i}} {(T_{i m_{1}} - E T_{i m_{1}})}^{'} H^{- 1} ({\tilde{T}}_{i m_{2}} - E {\tilde{T}}_{i m_{2}}) . \end{matrix}

We apply Theorem A2 to

I_{1}

and get:

\begin{matrix} P (| I_{1} | > t) \leq C P ({\tilde{I}}_{1} > C^{- 1} t) . \end{matrix}

(A44)

It suffices to obtain the large-deviation of

{\tilde{I}}_{1}

instead. Rewrite:

\begin{matrix} {\tilde{I}}_{1} & = \frac{1}{N_{i}} \sum_{m_{1}}^{N_{i}} {({\tilde{T}}_{i m_{1}} - E {\tilde{T}}_{i m_{1}})}^{'} H^{- 1 / 2} (\frac{1}{N_{i}} \sum_{m = 1}^{N_{i}} H^{- 1 / 2} (T_{i m} - E T_{i m})) \\ - \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} {(T_{i m} - E T_{i m})}^{'} H^{- 1} ({\tilde{T}}_{i m} - E {\tilde{T}}_{i m}) \\ = : T_{1} + T_{2} . \end{matrix}

(A45)

We derive the high-probability bound for

T_{1}

first. For simplicity, write:

a = H^{- 1 / 2} (\frac{1}{N_{i}} \sum_{m = 1}^{N_{i}} (T_{i m} - E T_{i m})) .

Then,

T_{1} = N_{i}^{- 1} \sum_{m = 1}^{N_{i}} {({\tilde{T}}_{i m} - E {\tilde{T}}_{i m})}^{'} H^{- 1 / 2} a

. We apply Bernstein inequality condition on

{T_{i m}}_{m = 1}^{N_{i}}

. By elementary computations:

\begin{matrix} var (T_{1} | {T_{i m}}_{m = 1}^{N_{i}}) & = \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} E [{({({\tilde{T}}_{i m} - E {\tilde{T}}_{i m})}^{'} H^{- 1 / 2} a)}^{2} | a] \\ = \frac{1}{N_{i}} \sum_{j = 1}^{p} d_{i}^{0} (j) {(a (j) / h_{j}^{1 / 2} - {(d_{i}^{0})}^{'} H^{- 1 / 2} a)}^{2} \\ = \frac{1}{N_{i}} \sum_{j = 1}^{p} \frac{d_{i}^{0} (j)}{h_{j}} a^{2} (j) - \frac{1}{N_{i}} {[{(d_{i}^{0})}^{'} H^{- 1 / 2} a]}^{2} \\ \leq {∥ a ∥}^{2} / N_{i}, \end{matrix}

where we used that fact

d_{i}^{0} (j) = e_{j}^{'} A w_{i} \leq e_{j}^{'} A 1_{K} = h_{j}

. Furthermore, with the individual bound

N^{- 1} {max}_{t} {a (t) / \sqrt{h_{t}}}

, we obtain from Bernstein inequality that with probability

1 - o (n^{- (2 C_{0} + 4)})

:

\begin{matrix} | T_{1} | \leq C (\sqrt{\frac{log (n)}{N}} ∥ a ∥ + \frac{1}{N} max_{t} \frac{| a (t) |}{\sqrt{h_{t}}} log (n)), \end{matrix}

by choosing appropriately large

C > 0

. We then consider using Bernstein inequality to study

a (t)

and get:

| a (t) | \leq C \sqrt{\frac{log (n)}{N}} + C \frac{log (n)}{N \sqrt{h_{min}}}

simultaneously for all

1 \leq t \leq p

, with probability

1 - o (n^{- (2 C_{0} + 4)})

. As a result, under the condition

min {p, N} \geq C_{0} log (n)

from Assumption 3, it holds that:

\begin{matrix} | T_{1} | & \leq C (\sqrt{\frac{log (n)}{N}} ∥ a ∥ + \frac{1}{N} max_{t} \frac{| a (t) |}{\sqrt{h_{t}}} log (n)) \\ \leq C (\sqrt{\frac{p log (n)}{N}} [\sqrt{\frac{log (n)}{N}} + C \frac{log (n)}{N \sqrt{h_{min}}}] + \frac{p}{N}) \\ \leq C \frac{p}{N} . \end{matrix}

(A46)

We then proceed to the second term in (A45),

T_{2} = N_{i}^{- 2} \sum_{m = 1}^{N_{i}} {(T_{i m} - E T_{i m})}^{'} H^{- 1} ({\tilde{T}}_{i m} - E {\tilde{T}}_{i m})

. Using Bernstein inequality, similarly to the above derivations, we get:

\begin{matrix} var (T_{2}) & = N_{i}^{- 4} \sum_{m = 1}^{N_{i}} E {({(T_{i m} - E T_{i m})}^{'} H^{- 1} ({\tilde{T}}_{i m} - E {\tilde{T}}_{i m}))}^{2} \\ = N_{i}^{- 4} \sum_{m = 1}^{N_{i}} E [\sum_{j = 1}^{p} \frac{d_{i}^{0} (j)}{h_{j}^{2}} {({\tilde{T}}_{i m} (j) - E {\tilde{T}}_{i m} (j))}^{2} - {({(d_{i}^{0})}^{'} H^{- 1} ({\tilde{T}}_{i m} - E {\tilde{T}}_{i m}))}^{2}] \\ = N_{i}^{- 3} [\sum_{j = 1}^{p} \frac{{(d_{i}^{0} (j))}^{2} (1 - d_{i}^{0} (j))}{h_{j}^{2}} - \sum_{j = 1}^{p} d_{i}^{0} (j) {(\frac{d_{i}^{0} (j)}{h_{j}} - {(d_{i}^{0})}^{'} H^{- 1} d_{i}^{0})}^{2}] \\ = N_{i}^{- 3} [\sum_{j = 1}^{p} \frac{{(d_{i}^{0} (j))}^{2} (1 - 2 d_{i}^{0} (j))}{h_{j}^{2}} + {({(d_{i}^{0})}^{'} H^{- 1} d_{i}^{0})}^{2}] \\ < 2 \frac{p}{N^{3}} . \end{matrix}

The individual bound is given by

N^{- 2} / h_{min}

. If follows from Bernstein inequality that:

\begin{matrix} T_{2} \leq C (\sqrt{\frac{p log (n)}{N^{3}}} + \frac{log (n)}{N^{2} h_{min}}) \end{matrix}

(A47)

with probability

1 - o (n^{- (2 C_{0} + 4)})

. Consequently, by pluging (A46) and (A47) into (A45) and using Assumption 3,

| {\tilde{I}}_{1} | ≲ \frac{p}{N}

(A48)

with probability

1 - o (n^{- (2 C_{0} + 4)})

. By (A44), we get:

\begin{matrix} | I_{1} | \leq C (\sqrt{\frac{log (n)}{N}} ∥ a ∥ + \frac{p}{N}) \end{matrix}

(A49)

with probability

1 - o (n^{- (2 C_{0} + 4)})

.

Second, we prove a similar bound for

I_{2}

with:

I_{2} = \frac{1}{N_{i}^{2}} \sum_{m = 1}^{N_{i}} {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) - E {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) .

We compute the variance by:

\begin{matrix} var {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) \\ = E {(\sum_{t} h_{t}^{- 1} {(T_{i m} (t) - d_{i}^{0} (t))}^{2})}^{2} - {(E \sum_{t} h_{t}^{- 1} {(T_{i m} (t) - d_{i}^{0} (t))}^{2})}^{2} \\ \leq \sum_{t} h_{t}^{- 2} d_{i}^{0} (t) [{(1 - d_{i}^{0} (t))}^{4} + (1 - d_{i}^{0} (t)) d_{i}^{0} {(t)}^{3}] - \sum_{t} h_{t}^{- 2} d_{i}^{0} {(t)}^{2} {(1 - d_{i}^{0} (t))}^{2} \\ \leq \sum_{t} h_{t}^{- 1} ≲ p h_{min}^{- 1} . \end{matrix}

This, together with the crude bound:

| {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) - E {(T_{i m} - E T_{i m})}^{'} H^{- 1} (T_{i m} - E T_{i m}) | \leq C h_{min}^{- 1},

gives that with probability

1 - o (n^{- (2 C_{0} + 4)})

, for some sufficiently large

C > 0

:

\begin{matrix} | I_{2} | \leq C max \{\sqrt{\frac{p log (n)}{N^{3} h_{min}}}, \frac{log (n)}{N^{2} h_{min}}\} \leq C \frac{p}{N}, \end{matrix}

(A50)

under Assumption 3. Combing (A49) and (A50), yields that:

\begin{matrix} {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} = z_{i}^{'} H^{- 1} z_{i} \leq E z_{i}^{'} H^{- 1} z_{i} + | I_{1} | + | I_{2} | \leq C \frac{p}{N} \end{matrix}

with probability

1 - o (n^{- (2 C_{0} + 4)})

. Thus, we conclude the claim

P (E_{i}^{c}) = o (n^{- (2 C_{0} + 4)})

for all

1 \leq i \leq n

. The proof of (a) is complete.

Next, we show the proof of (b). Recall the second term on the RHS of (A40). Using the convexity of

∥ \cdot ∥

and the trivial bound:

E | {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} 1_{E_{i}^{c}} | \leq P (E_{i}^{c}) {∥ {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} ∥}_{max} \leq h_{min}^{- 1} P (E_{i}^{c}),

we get:

\begin{matrix} ∥ \sum_{i = 1}^{n} E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}^{c}}) ∥ \leq \sum_{i = 1}^{n} E ∥ {\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}^{c}} ∥ = \sum_{i = 1}^{n} E | {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} 1_{E_{i}^{c}} | \leq o (n^{- (2 C_{0} + 4)}) n p = o (n^{- (C_{0} + 3)}) . \end{matrix}

Here, in the last step, we used the fact that

p \leq n^{C_{0}}

, which follows from the second condition in Assumption 3. This yields the estimate in (b).

Finally, we claim (c) by Matrix Bernstein inequality (i.e., Theorem A1). Towards that, we need to derive the upper bounds of

∥ {\tilde{X}}_{i} ∥

and

∥ E {\tilde{X}}_{i}^{2} ∥

. By definition of

{\tilde{X}}_{i}

, that is:

{\tilde{X}}_{i} : = \frac{1}{n} ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}} - E {\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}),

we easily derive that:

\begin{matrix} ∥ {\tilde{X}}_{i} ∥ \leq \frac{1}{n} (| {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} 1_{E_{i}} | + ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}) ∥) \leq \frac{1}{n} (| {\tilde{z}}_{i}^{'} {\tilde{z}}_{i} 1_{E_{i}} | + ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}^{c}}) ∥ + ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'}) ∥) \leq \frac{C p}{n N} \end{matrix}

for some large

C > 0

, in which we used the estimate:

\begin{matrix} ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'}) ∥ = ∥ H^{- 1 / 2} E (z_{i} z_{i}^{'}) H^{- 1 / 2} ∥ & \leq N_{i}^{- 1} ∥ H^{- 1 / 2} (diag (d_{i}^{0}) - d_{i}^{0} {(d_{i}^{0})}^{'}) H^{- 1 / 2} ∥ \\ \leq N_{i}^{- 1} ∥ H^{- 1 / 2} diag (d_{i}^{0}) H^{- 1 / 2} ∥ + N_{i}^{- 1} | {(d_{i}^{0})}^{'} H^{- 1} d_{i}^{0} | \\ \leq \frac{2}{N} . \end{matrix}

By the above inequality, it also holds that:

\begin{matrix} ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}) ∥ \leq ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}^{c}}) ∥ + ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'}) ∥ \leq \frac{C}{N} . \end{matrix}

Moreover:

\begin{matrix} ∥ E {\tilde{X}}_{i}^{2} ∥ & = ∥ n^{- 2} E (∥ {\tilde{z}}_{i} ∥^{2} {\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}) - n^{- 2} {(E {\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}})}^{2} ∥ \\ \leq \frac{p}{n^{2} N} ∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}) ∥ + \frac{1}{n^{2}} {∥ E ({\tilde{z}}_{i} {\tilde{z}}_{i}^{'} 1_{E_{i}}) ∥}^{2} \\ \leq \frac{C p}{n^{2} N^{2}} . \end{matrix}

Since

E {\tilde{X}}_{i} = 0

, it follows from Theorem A1 that:

\begin{matrix} P (∥ \sum_{i = 1}^{n} {\tilde{X}}_{i} ∥ \geq t) \leq 2 n exp (\frac{- t^{2} / 2}{σ^{2} + b t / 3}), \end{matrix}

with

σ^{2} = C p / (n N^{2})

,

b = C p / (n N)

. As a result:

\begin{matrix} ∥ \sum_{i = 1}^{n} {\tilde{X}}_{i} ∥ \leq C max \{\sqrt{\frac{p log (n)}{n N^{2}}}, \frac{p log (n)}{n N}\} \end{matrix}

with probability

1 - o (n^{- 3})

, for some large

C > 0

. We hence finish the proof of (c). The proof of (A28) is complete now.

Lastly, we prove

∥ E_{1} ∥ \leq C \sqrt{p n log (n)} / \sqrt{N}

. By definition, we rewrite:

\begin{matrix} E_{1} = (M^{- 1 / 2} M_{0}^{1 / 2}) M_{0}^{- 1 / 2} D D^{'} M_{0}^{- 1 / 2} (M^{- 1 / 2} M_{0}^{1 / 2} - I_{p}) \\ + (M^{- 1 / 2} M_{0}^{1 / 2} - I_{p}) M_{0}^{- 1 / 2} D D^{'} M_{0}^{- 1 / 2} . \end{matrix}

(A51)

Decomposing D by

D_{0} + Z

gives rise to:

\begin{matrix} M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}} & = M_{0}^{- \frac{1}{2}} \sum_{i = 1}^{n} (1 - N_{i}^{- 1}) d_{i}^{0} {(d_{i}^{0})}^{'} M_{0}^{- \frac{1}{2}} + \frac{n}{N} I_{p} + M_{0}^{- \frac{1}{2}} D_{0} Z^{'} M_{0}^{- \frac{1}{2}} + M_{0}^{- \frac{1}{2}} Z D_{0}^{'} M_{0}^{- \frac{1}{2}} \\ + M_{0}^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) M_{0}^{- \frac{1}{2}} \\ = G_{0} + \frac{n}{N} I_{p} + E_{2} + E_{3} + E_{4} \end{matrix}

(A52)

Applying Lemma A2, together with (A38) and (A39), we see that:

\begin{matrix} ∥ M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}} ∥ \leq C n \end{matrix}

Furthermore, it follows from Lemma A1 that:

∥ M^{- 1 / 2} M_{0}^{1 / 2} - I_{p} ∥ \leq C \sqrt{\frac{p log (n)}{N n}}, and ∥ M^{- 1 / 2} M_{0}^{1 / 2} ∥ = 1 + o (1) .

Combining the estimates above, we conclude that:

\begin{matrix} ∥ E_{1} ∥ \leq C \sqrt{\frac{p n log (n)}{N}} \end{matrix}

We therefore finish the proof of Lemma A4. □

Appendix D.3. Proof of Lemma A5

We begin with the proof of (A29). Recall the definitions:

E_{2} = M_{0}^{- \frac{1}{2}} Z D_{0}^{'} M_{0}^{- \frac{1}{2}}, E_{3} = M_{0}^{- \frac{1}{2}} D_{0} Z^{'} M_{0}^{- \frac{1}{2}} .

We bound:

\begin{matrix} ∥ e_{j}^{'} E_{2} \hat{Ξ} ∥ / n & \leq ∥ e_{j}^{'} E_{2} ∥ / n \leq \frac{1}{n} \sum_{k = 1}^{K} ∥ e_{j}^{'} M_{0}^{- 1 / 2} Z W_{k} ∥ \cdot ∥ A_{k}^{'} M_{0}^{- \frac{1}{2}} ∥ \leq \frac{C}{n} \sum_{k = 1}^{K} ∥ e_{j}^{'} M_{0}^{- 1 / 2} Z W_{k} ∥ \end{matrix}

by the second inequality in (A35). Similarly to how we derived (A37), using Bernstein inequality, we have:

\begin{matrix} ∥ e_{j}^{'} M_{0}^{- 1 / 2} Z W_{k} ∥ & \leq C \frac{\sum_{i = 1}^{n} z_{i} (j) W_{k} (i)}{\sqrt{h_{j}}} = C \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 1} h_{j}^{- 1 / 2} (T_{i m} (j) - d_{i}^{0} (j)) W_{k} (i) \\ \leq C \sqrt{\frac{∥ W_{k} ∥^{2} log (n)}{N}} + \frac{C log (n)}{N \sqrt{h_{j}}} \\ \leq C \sqrt{\frac{n log (n)}{N}} + \frac{C log (n)}{N \sqrt{h_{j}}} \end{matrix}

with probability

1 - o (n^{- C_{0} - 3})

. Consequently:

\begin{matrix} ∥ e_{j}^{'} E_{2} \hat{Ξ} ∥ / n \leq C \sqrt{\frac{log (n)}{N n}} + C \frac{log (n)}{n N \sqrt{h_{j}}} \leq C \sqrt{\frac{log (n)}{N n}} \leq C \sqrt{\frac{h_{j} p log (n)}{N n}} \end{matrix}

(A53)

in view of

p log {(n)}^{2} \leq N n

and

h_{j} \geq h_{min} \geq c / p

from Assumption 3.

Analogously, for

Ξ_{3}

, we have:

\begin{matrix} ∥ e_{j}^{'} E_{3} \hat{Ξ} ∥ / n \leq \frac{1}{n} \sum_{k = 1}^{K} ∥ e_{j}^{'} M_{0}^{- 1 / 2} A_{k} ∥ \cdot ∥ W_{k}^{'} Z^{'} M_{0}^{- 1 / 2} \hat{Ξ} ∥ \leq C \sqrt{\frac{h_{j} p log (n)}{N n}} . \end{matrix}

(A54)

where we used

∥ W_{k}^{'} Z^{'} M_{0}^{- 1 / 2} \hat{Ξ} ∥ \leq ∥ M_{0}^{- 1 / 2} Z W_{k} ∥ \leq \sqrt{p n log (n)} / \sqrt{N}

from (A37) and

∥ e_{j}^{'} M_{0}^{- 1 / 2} A_{k} ∥ \leq C \sqrt{h_{j}}

. Hence, we complete the proof of (A29).

In the sequel, we focus on the proof of (A30). Recall that

E_{4} = M_{0}^{- \frac{1}{2}} (Z Z^{'} - E Z Z^{'}) M_{0}^{- \frac{1}{2}}

. We expect to show that:

∥ e_{j}^{'} E_{4} \hat{Ξ} ∥ / n \leq C \sqrt{\frac{h_{j} p log (n)}{N n}} (1 + ∥ H_{0}^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) .

Let us decompose

∥ e_{j}^{'} E_{4} \hat{Ξ} ∥ / n

as follows:

\begin{matrix} n^{- 1} ∥ e_{j}^{'} E_{4} \hat{Ξ} ∥ \leq n^{- 1} ∥ e_{j}^{'} E_{4} Ξ ∥ + n^{- 1} ∥ e_{j}^{'} E_{4} (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

We bound

n^{- 1} ∥ e_{j}^{'} E_{4} Ξ ∥

first. For any fixed

1 \leq k \leq K

, in light of the fact that

M_{0} (j, j) ≍ h_{j}

for all

1 \leq j \leq p

:

\begin{matrix} | e_{j}^{'} E_{4} ξ_{k} | ≍ | e_{j}^{'} H^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) H^{- 1 / 2} ξ_{k} | = | \sum_{i = 1}^{n} h_{j}^{- 1 / 2} z_{i} (j) z_{i}^{'} H^{- 1 / 2} ξ_{k} - h_{j}^{- 1 / 2} E z_{i} (j) z_{i}^{'} H^{- 1 / 2} ξ_{k} | \\ = |\sum_{i = 1}^{n} \frac{1}{N_{i}^{2}} \sum_{m, m_{1} = 1}^{N_{i}} \frac{T_{i m} (j) - d_{i}^{0} (j)}{\sqrt{h_{j}}} \cdot {(T_{i m_{1}} - d_{i}^{0})}^{'} H^{- \frac{1}{2}} ξ_{k} - E [\frac{T_{i m} (j) - d_{i}^{0} (j)}{\sqrt{h_{j}}} \cdot {(T_{i m_{1}} - d_{i}^{0})}^{'} H^{- \frac{1}{2}} ξ_{k}]| \\ \leq | J_{1} | + | J_{2} |, \end{matrix}

with:

\begin{matrix} J_{1} : = \sum_{i = 1}^{n} \frac{1}{N_{i}^{2}} \sum_{m}^{N_{i}} {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k} \\ - E {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k}, \\ J_{2} : = \sum_{i = 1}^{n} \frac{1}{N_{i}^{2}} \sum_{m \neq m_{1}}^{N_{i}} {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot {(T_{i m_{1}} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k} . \end{matrix}

For

J_{1}

, it is easy to compute the order of its variance as follows:

\begin{matrix} var (J_{1}) \\ = \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 4} var ({(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k}) \\ = \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 4} d_{i}^{0} (j) \cdot \frac{{(1 - d_{i}^{0} (j))}^{2}}{h_{j}} {(\frac{ξ_{k} (j)}{\sqrt{h_{j}}} - \sum_{t} \frac{d_{i}^{0} (t) ξ_{k} (t)}{\sqrt{h_{t}}})}^{2} \\ + \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 4} \sum_{t \neq j} d_{i}^{0} (t) \cdot \frac{{(d_{i}^{0} (j))}^{2}}{h_{j}} {(\frac{ξ_{k} (t)}{\sqrt{h_{t}}} - \sum_{s} \frac{d_{i}^{0} (s) ξ_{k} (s)}{\sqrt{h_{s}}})}^{2} \\ - \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} \frac{1}{N_{i}^{4}} {(\frac{d_{i}^{0} (j)}{\sqrt{h_{j}}} (\frac{ξ_{k} (j)}{\sqrt{h_{j}}} - \sum_{t} \frac{d_{i}^{0} (t) ξ_{k} (t)}{\sqrt{h_{t}}}) - \sum_{j = 1}^{p} \frac{{(d_{i}^{0} (j))}^{2}}{\sqrt{h_{j}}} (\frac{ξ_{k} (j)}{\sqrt{h_{j}}} - \sum_{t} \frac{d_{i}^{0} (t) ξ_{k} (t)}{\sqrt{h_{t}}}))}^{2} \\ \leq C \frac{n}{N^{3}}, \end{matrix}

where we used the facts that

ξ_{k} (t) \leq \sqrt{h_{t}}

,

d_{i}^{0} (j) \leq C h_{j}

, and

\sum_{t} d_{i}^{0} (t) = 1

. Furthermore, with the trivial bound of each summand in

J_{1}

given by

C N^{- 2} h_{j}^{- 1 / 2}

, it follows from the Bernstein inequality that:

\begin{matrix} | J_{1} | \leq C \sqrt{\frac{n log (n)}{N^{3}}} + C \frac{log (n)}{N^{2} \sqrt{h_{j}}} \leq C \sqrt{\frac{n log (n)}{N^{3}}} \end{matrix}

with probability

1 - o (n^{- 3 - C_{0}})

. Here, we used the conditions that

h_{j} \geq C / p

and

p log {(n)}^{2} \leq N n

.

We proceed to estimate

| J_{2} |

. Employing Theorem A3 with:

h (T_{i m}, T_{i m_{1}}) = N_{i}^{- 2} {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot {(T_{i m_{1}} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k},

it suffices to examine the high probability bound of:

{\tilde{J}}_{2} : = \sum_{i = 1}^{n} \frac{1}{N_{i}^{2}} \sum_{m \neq m_{1}}^{N_{i}} {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot {({\tilde{T}}_{i m_{1}} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k}

where

{{\tilde{T}}_{i m_{1}}}

is an independent copy of

{T_{i m_{1}}}

. Imitating the proof of (A45), we rewrite:

{\tilde{J}}_{2} = \sum_{i = 1}^{n} \sum_{m = 1}^{N_{i}} N_{i}^{- 1} {(T_{i m} - d_{i}^{0})}^{'} H^{- 1 / 2} e_{j} \cdot b_{i m} where b_{i m} = (\sum_{m_{1} \neq m} N_{i}^{- 1} {({\tilde{T}}_{i m_{1}} - d_{i}^{0})}^{'} H^{- 1 / 2} ξ_{k})

Notice that

b_{i m}

can be crudely bounded by C in view of

ξ_{k} (t) \leq \sqrt{h_{t}}

. Then, condition on

{{\tilde{T}}_{i m_{1}}}

, by Bernstein inequality, we can derive that:

\begin{matrix} | {\tilde{J}}_{2} | \leq C (\sqrt{\frac{n log (n)}{N}} + \frac{log (n)}{N \sqrt{h_{j}}}) \leq C \sqrt{\frac{n log (n)}{N}} \end{matrix}

with probability

1 - o (n^{- 3 - C_{0}})

. Consequently, we arrive at:

\begin{matrix} | e_{j}^{'} E_{4} ξ_{k} | \leq C \sqrt{\frac{n log (n)}{N}} \leq C \sqrt{\frac{h_{j} p n log (n)}{N}} \end{matrix}

under the assumption that

h_{j} \geq C / p

. As K is a fixed constant, we further conclude:

\begin{matrix} ∥ e_{j}^{'} E_{4} Ξ ∥ \leq C \sqrt{\frac{h_{j} p n log (n)}{N}} \end{matrix}

(A55)

with probability

1 - o (n^{- 3 - C_{0}})

.

Next, we estimate

n^{- 1} ∥ e_{j}^{'} E_{4} (\hat{Ξ} - Ξ O^{'}) ∥

. By definition, we write:

\begin{matrix} \frac{1}{n} ∥ e_{j}^{'} E_{4} (\hat{Ξ} - Ξ O^{'}) ∥ = \frac{1}{n} ∥ e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) M_{0}^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

For each

1 \leq t \leq p

:

\begin{matrix} \frac{1}{n} | e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) e_{t} | \\ ≍ \frac{1}{n \sqrt{h_{j}}} \sum_{i = 1}^{n} z_{i} (j) z_{i} (t) - E (z_{i} (j) z_{j} (t)) \\ = \frac{1}{n \sqrt{h_{j}}} \sum_{i} \sum_{m, \tilde{m}} N_{i}^{- 2} (T_{i m} (j) - d_{i}^{0} (j)) (T_{i \tilde{m}} (t) - d_{i}^{0} (t)) - E (T_{i m} (j) - d_{i}^{0} (j)) (T_{i \tilde{m}} (t) - d_{i}^{0} (t)) \\ = \frac{1}{n \sqrt{h_{j}}} \sum_{i, m} N_{i}^{- 2} (T_{i m} (j) - d_{i}^{0} (j)) (T_{i m} (t) - d_{i}^{0} (t)) - E (T_{i m} (j) - d_{i}^{0} (j)) (T_{i m} (t) - d_{i}^{0} (t)) \\ + \frac{1}{n \sqrt{h_{j}}} \sum_{i} N_{i}^{- 2} \sum_{m \neq \tilde{m}} (T_{i m} (j) - d_{i}^{0} (j)) (T_{i \tilde{m}} (t) - d_{i}^{0} (t)) \\ : = {(I)}_{t} + {(I I)}_{t} . \end{matrix}

For

{(I)}_{k}

, using Bernstein inequality, it yields that with probability

1 - o (n^{- 3 - 2 C_{0}})

:

\begin{matrix} | {(I)}_{t} | & \leq C \{\begin{matrix} max \{\sqrt{\frac{(h_{j} + h_{t}) h_{t} log (n)}{n N^{3}}}, \frac{(h_{j} + h_{t}) log (n)}{n N^{2} \sqrt{h_{j}}}\}, & t \neq j \\ max \{\sqrt{\frac{log (n)}{n N^{3}}}, \frac{log (n)}{n N^{2} \sqrt{h_{j}}}\}, & t = j \end{matrix} \\ \leq C \{\begin{matrix} \sqrt{\frac{(h_{j} + h_{t}) h_{t} log (n)}{n N^{3}}}, & t \neq j \\ \sqrt{\frac{log (n)}{n N^{3}}}, & t = j \end{matrix} \end{matrix}

where the last step is due the the fact

p log {(n)}^{2} \leq N n

from Assumption 3. As a result:

\begin{matrix} \sum_{t = 1}^{p} | {(I)}_{t} | \leq C (\sqrt{p} \sqrt{\frac{\sum_{t \neq j} h_{j} h_{t} log (n)}{n N^{3}}} + \sum_{t \neq j} h_{t} \sqrt{\frac{log (n)}{n N^{3}}} + \sqrt{\frac{log (n)}{n N^{3}}}) \leq C \sqrt{\frac{h_{j} p log (n)}{n N^{3}}} \end{matrix}

(A56)

Here, we used the Cauchy–Schwarz inequality to get:

\sum_{t \neq j} \sqrt{\frac{h_{j} h_{t} log (n)}{n N^{3}}} \leq \sqrt{p - 1} \cdot \sum_{t \neq j} \frac{h_{j} h_{t} log (n)}{n N^{3}} \leq \sqrt{p} \sqrt{\frac{\sum_{t \neq j} h_{j} h_{t} log (n)}{n N^{3}}} .

For

{(I I)}_{t}

, since it is a U-statistics, we then apply the decoupling idea, i.e., Theorem A3, such that its high probability bound can be controlled by that of

{(\tilde{I I})}_{t}

, defined by:

\begin{matrix} {(\tilde{I I})}_{t} : = \frac{1}{n \sqrt{h_{j}}} \sum_{i} N_{i}^{- 2} \sum_{m \neq \tilde{m}} (T_{i m} (j) - d_{i}^{0} (j)) ({\tilde{T}}_{i \tilde{m}} (t) - d_{i}^{0} (t)) . \end{matrix}

where

{{\tilde{T}}_{i \tilde{m}}}_{i, \tilde{m}}

is the i.i.d. copy of

{T_{i m}}_{i, m}

. We further express:

\begin{matrix} {(\tilde{I I})}_{t} & = \frac{1}{n \sqrt{h_{j}}} \sum_{i} N_{i}^{- 2} \sum_{m} (T_{i m} (j) - d_{i}^{0} (j)) {\tilde{T}}_{i, - m}, \end{matrix}

where

{\tilde{T}}_{i, - m} : = \sum_{\tilde{m} \neq m} ({\tilde{T}}_{i \tilde{m}} (t) - d_{i}^{0} (t))

. Condition on

{{\tilde{T}}_{i \tilde{m}}}_{i, \tilde{m}}

, we use Bernstein inequality and get:

\begin{matrix} {(\tilde{I I})}_{t} & \leq C max \{\sqrt{\frac{log (n) \cdot \sum_{i, m} {\tilde{T}}_{i, - m}^{2}}{n^{2} N^{4}}}, \frac{log (n) \cdot {max}_{i, m} {| \tilde{T}}_{i, - m} |}{n N^{2} \sqrt{h_{j}}}\} \\ \leq C \sqrt{\frac{log (n) \cdot {max}_{i, m} {| \tilde{T}}_{i, - m} |^{2}}{n N^{3}}}, \end{matrix}

in light of

p log {(n)}^{2} \leq N n

. Furthermore, notice that:

\begin{matrix} max_{i, m} {| \tilde{T}}_{i, - m} | \leq \sum_{\tilde{m}} | {\tilde{T}}_{i \tilde{m}} (t) - d_{i}^{0} (t) | . \end{matrix}

It follows that:

\begin{matrix} \sum_{t = 1}^{p} | {(\tilde{I I})}_{t} | \leq C \sqrt{\frac{log (n)}{n N}} \cdot \frac{1}{N} \sum_{t = 1}^{p} max_{i, m} {| \tilde{T}}_{i, - m} | & \leq C \sqrt{\frac{log (n)}{n N}} \cdot \frac{1}{N} \sum_{t = 1}^{p} \sum_{\tilde{m}} | {\tilde{T}}_{i \tilde{m}} (t) - d_{i}^{0} (t) | \\ \leq C \sqrt{\frac{log (n)}{n N}}, \end{matrix}

(A57)

where the last step is due to the trivial bound that:

\sum_{t = 1}^{p} | {\tilde{T}}_{i \tilde{m}} (t) - d_{i}^{0} (t) | \leq 1 + \sum_{t = 1}^{p} d_{i}^{0} (t) \leq C

for any

1 \leq \tilde{m} \leq N

. Thus, combining (A56) and (A57), under the condition

h_{j} \geq C / p

, we obtain:

\begin{matrix} \frac{1}{n} ∥ e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) ∥_{1} = \frac{1}{n} \sum_{t = 1}^{p} | e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) e_{t} | \leq C \sqrt{\frac{h_{j} p log (n)}{n N}} \end{matrix}

(A58)

with probability

1 - o (n^{- 3 - C_{0}})

.

Moreover, employing the estimate

M_{0} (j, j) ≍ h_{j}

for all

1 \leq j \leq p

, it follows that:

\begin{matrix} \frac{1}{n} ∥ e_{j}^{'} E_{4} (\hat{Ξ} - Ξ O^{'}) ∥ & = \frac{1}{n} ∥ e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) M_{0}^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥ \\ \leq \frac{1}{n} ∥ e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) ∥_{1} \cdot ∥ M_{0}^{- 1 / 2} H^{1 / 2} ∥ \cdot ∥ H^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \\ \leq C \sqrt{\frac{h_{j} p log (n)}{n N}} {∥ H^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥}_{2 \to \infty} \end{matrix}

(A59)

with probability

1 - o (n^{- 3 - C_{0}})

.

In the end, we combine (A55) and (A59) and consider all j simultaneously to conclude that:

\begin{matrix} n^{- 1} ∥ e_{j}^{'} E_{4} \hat{Ξ} ∥ & \leq n^{- 1} ∥ e_{j}^{'} E_{4} Ξ ∥ + n^{- 1} ∥ e_{j}^{'} E_{4} (\hat{Ξ} - Ξ O^{'}) ∥ \\ \leq C \sqrt{\frac{h_{j} p log (n)}{n N}} (1 + ∥ H^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) \end{matrix}

with probability

1 - o (n^{- 3 - C_{0}})

. Combining all

1 \leq j \leq p

, together with

p \leq n^{C_{0}}

, we complete the proof. □

Appendix D.4. Proof of Lemma A6

We first prove (A31) that:

∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n \leq C \sqrt{h_{j}} \cdot \frac{p log (n)}{n N} (1 + ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty})

By the definition that

E_{4} = M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) M_{0}^{- 1 / 2}

, we bound:

\begin{matrix} ∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n \leq \frac{1}{n} ∥ e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) ∥_{1} \cdot {∥ M_{0}^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥}_{2 \to \infty} . \end{matrix}

From (A58), it holds that

∥ e_{j}^{'} M_{0}^{- 1 / 2} (Z Z^{'} - E Z Z^{'}) ∥_{1} / n \leq C \sqrt{h_{j} p log (n)} / \sqrt{n N}

with probability

1 - o (n^{- 3 - C_{0}})

. Next, we bound:

\begin{matrix} ∥ M_{0}^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥_{2 \to \infty} & \leq ∥ H^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) {Ξ ∥}_{2 \to \infty} \\ + ∥ H^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \end{matrix}

The first term on the RHS can be bounded simply by:

\begin{matrix} ∥ H^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) {Ξ ∥}_{2 \to \infty} & \leq C max_{i} | h_{i}^{- 1 / 2} \sqrt{p log (n) / n N} \cdot \sqrt{h_{i}} | \\ \leq C \sqrt{p log (n) / n N} = o (1) \end{matrix}

The second term can be simplified to:

\begin{matrix} ∥ H^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} & = ∥ (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) H^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty} \\ \leq C \sqrt{\frac{p log (n)}{n N}} \cdot {∥ H^{- 1 / 2} (\hat{Ξ} - Ξ O^{'}) ∥}_{2 \to \infty} . \end{matrix}

As a result:

\begin{matrix} ∥ e_{j}^{'} E_{4} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n & \leq C \sqrt{\frac{h_{j} p log (n)}{n N}} \cdot \sqrt{\frac{p log (n)}{n N}} (1 + ∥ H_{0}^{- \frac{1}{2}} (Ξ - Ξ_{0} O^{'}) ∥_{2 \to \infty}) \\ \leq C \sqrt{h_{j}} \cdot \frac{p log (n)}{n N} (1 + ∥ H^{- \frac{1}{2}} (\hat{Ξ} - Ξ O^{'}) ∥_{2 \to \infty}) . \end{matrix}

(A60)

This proves (A31).

Subsequently, we prove (A32) that:

∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ \leq C \sqrt{\frac{log (n)}{N n}} + o (β_{n}) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ .

We first bound:

\begin{matrix} ∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ & \leq ∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) Ξ ∥ + ∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

By Lemma A1,

| M (j, j) - M_{0} (j, j) | / M_{0} (j, j) \leq C \sqrt{log (n)} / \sqrt{N n h_{j}}

. It follows that:

\begin{matrix} ∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) Ξ ∥ & \leq |\sqrt{\frac{M (j, j)}{M_{0} (j, j)}} - 1| \cdot ∥ e_{j}^{'} Ξ ∥ \\ \leq C \frac{| M (j, j) - M_{0} (j, j) |}{M_{0} (j, j)} \cdot ∥ e_{j}^{'} Ξ ∥ \\ \leq C \sqrt{\frac{log (n)}{N n}}, \end{matrix}

and:

\begin{matrix} ∥ e_{j}^{'} (M^{1 / 2} M_{0}^{- 1 / 2} - I_{p}) (\hat{Ξ} - Ξ O^{'}) ∥ & \leq |\sqrt{\frac{M (j, j)}{M_{0} (j, j)}} - 1| \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \\ \leq \sqrt{\frac{p log (n)}{N n}} \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ \\ = o (β_{n}) \cdot ∥ e_{j}^{'} (\hat{Ξ} - Ξ O^{'}) ∥ . \end{matrix}

by the condition that

p log (n) ≪ N n

. We therefore conclude (A32), simultaneously for all

1 \leq j \leq p

, with probability

1 - o (n^{- 3})

.

Lastly, we prove (A33). By the definition:

E_{1} = M^{- \frac{1}{2}} D D^{'} M^{- \frac{1}{2}} - M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}},

and the decomposition:

\begin{matrix} M_{0}^{- \frac{1}{2}} D D^{'} M_{0}^{- \frac{1}{2}} & = G_{0} + \frac{n}{N} I_{p} + E_{2} + E_{3} + E_{4}, where G_{0} = M_{0}^{- 1 / 2} \sum_{i = 1}^{n} (1 - N_{i}^{- 1}) d_{i}^{0} {(d_{i}^{0})}^{'} M_{0}^{- 1 / 2}, \end{matrix}

we bound:

\begin{matrix} ∥ e_{j} E_{1} \hat{Ξ} ∥ / n \\ \leq ∥ e_{j}^{'} (I_{p} - M_{0}^{- 1 / 2} M^{1 / 2}) M^{- 1 / 2} D D^{'} M^{- 1 / 2} \hat{Ξ} ∥ / n + ∥ e_{j}^{'} M_{0}^{- 1 / 2} D D^{'} M_{0}^{- 1 / 2} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n \\ \leq C ∥ e_{j}^{'} (I_{p} - M_{0}^{- 1 / 2} M^{1 / 2}) \hat{Ξ} ∥ + C ∥ e_{j}^{'} G_{0} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n \\ + ∥ e_{j}^{'} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / N + \sum_{i = 2}^{4} ∥ e_{j}^{'} E_{i} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n, \end{matrix}

where we used the fact that

M^{- 1 / 2} D D^{'} M^{- 1 / 2} \hat{Ξ} = \tilde{Λ} \hat{Ξ}

, where

\tilde{Λ} = \hat{Λ} + n N^{- 1} I_{p}

, which leads to

∥ \tilde{Λ} ∥ \leq C n

.

In the same manner to prove

∥ e_{j}^{'} E_{2} \hat{Ξ} ∥ / n

and

∥ e_{j}^{'} E_{3} \hat{Ξ} ∥ / n

, we can bound:

\begin{matrix} \frac{1}{n} ∥ e_{j}^{'} E_{s} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ \leq \frac{1}{n} ∥ e_{j}^{'} E_{s} ∥ ∥ M_{0}^{1 / 2} M^{- 1 / 2} - I_{p} ∥ \leq C \sqrt{\frac{h_{j} p log (n)}{N n}}, for s = 2, 3 . \end{matrix}

(A61)

By Lemma A1, we derive:

\begin{matrix} ∥ e_{j}^{'} G_{0} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n & \leq C \sum_{t = 1}^{p} \frac{1}{\sqrt{h_{j} h_{t}}} | a_{j}^{'} Σ_{W} a_{t} | \sqrt{\frac{log (n)}{h_{t} N n}} ∥ e_{t}^{'} \hat{Ξ} ∥ \\ \leq C \sqrt{\frac{h_{j} p log (n)}{N n}}, \end{matrix}

(A62)

where we crudely bound

| a_{j}^{'} Σ_{W} a_{t} | \leq h_{j} h_{t}

, and use Cauchy–Schwarz inequality that

\sum_{t = 1}^{p} ∥ e_{t}^{'} \hat{Ξ} ∥ \leq \sqrt{p} \sqrt{tr (\hat{Ξ} {\hat{Ξ}}^{'})} \leq K \sqrt{p}

. In addition:

\begin{matrix} ∥ e_{j}^{'} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / N & \leq |\sqrt{M_{0} (j, j)} / \sqrt{M (j, j)}| \cdot ∥ e_{j}^{'} (I_{p} - M_{0}^{- 1 / 2} M^{1 / 2}) \hat{Ξ} ∥ \\ \leq C ∥ e_{j}^{'} (I_{p} - M_{0}^{- 1 / 2} M^{1 / 2}) \hat{Ξ} ∥, \end{matrix}

which results in:

\begin{matrix} ∥ e_{j} E_{1} \hat{Ξ} ∥ / n \leq C ∥ e_{j}^{'} (I_{p} - M_{0}^{- 1 / 2} M^{1 / 2}) \hat{Ξ} ∥ + C ∥ e_{j}^{'} G_{0} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n \\ + \sum_{i = 2}^{4} ∥ e_{j}^{'} E_{i} (M_{0}^{1 / 2} M^{- 1 / 2} - I_{p}) \hat{Ξ} ∥ / n . \end{matrix}

(A63)

Combining (A61), (A62), (A31), and (A32) into the above inequality, we complete the proof of (A33).□

Appendix E. Proofs of the Rates for Topic Modeling

The proofs in this section are quite similar to those in [4] by employing the bounds in Theorem 1. For readers’ convenience, we provide brief sketches and refer to more details in the supplementary materials of [4]. Notice that

N_{i} ≍ \bar{N} ≍ N

from Assumption 3. Therefore, throughout this section, we always assume

\bar{N} = N

without loss of generality.

Appendix E.1. Proof of Theorem 2

Recall that:

\hat{R} = {({\hat{r}}_{1}, {\hat{r}}_{2}, \dots, {\hat{r}}_{p})}^{'} = {[diag ({\hat{ξ}}_{1})]}^{- 1} ({\hat{ξ}}_{2}, \dots, ξ_{K}) .

Since the first eigenvector of

G_{0}

is with multiplicity one, which can been seen in Lemma A2, and the fact that

∥ G - G_{0} ∥ ≪ n

, it is not hard to obtain that

O^{'} = diag (ω, Ω^{'})

where

ω \in {1, - 1}

and

Ω^{'}

is an orthogonal matrix in

R^{K - 1, K - 1}

. Let us write

{\hat{Ξ}}_{1} : = ({\hat{ξ}}_{2}, \dots, {\hat{ξ}}_{K})

and similarly for

Ξ_{1}

. Without loss of generality, we assume

ω = 1

. Therefore:

\begin{matrix} | ξ_{1} (j) - {\hat{ξ}}_{1} (j) | \leq C \sqrt{\frac{h_{j} p log (n)}{N n β_{n}^{2}}}, ∥ e_{j}^{'} ({\hat{Ξ}}_{1} - Ξ_{1}) Ω^{'} ∥ \leq C \sqrt{\frac{h_{j} p log (n)}{N n β_{n}^{2}}} . \end{matrix}

(A64)

We rewrite:

\begin{matrix} {\hat{r}}_{j}^{'} - r_{j}^{'} Ω^{'} = {\hat{Ξ}}_{1} (j) \cdot \frac{ξ_{1} (j) - {\hat{ξ}}_{1} (j)}{{\hat{ξ}}_{1} (j) ξ_{1} (j)} - \frac{e_{j}^{'} ({\hat{Ξ}}_{1} - Ξ_{1} Ω^{'})}{ξ_{1} (j)} . \end{matrix}

Using Lemma A2 together with (A64), we conclude the proof. □

Appendix E.2. Proof of Theorem 3

In this section, we provide a simplified proof by neglecting the details about some quantities in the oracle case. We refer readers to the proof of Theorem 3.3 of [4] for more rigorous arguments.

Proof of Theorem 3.

Recall the Topic-SCORE algorithm. Let

\hat{V} = ({\hat{v}}_{1}, {\hat{v}}_{2}, \dots, {\hat{v}}_{K})

and denote its population counterpart by V. We write:

\begin{matrix} \hat{Q} = (\begin{matrix} 1 & \dots & 1 \\ {\hat{v}}_{1} & \dots & {\hat{v}}_{K} \end{matrix}), Q = (\begin{matrix} 1 & \dots & 1 \\ v_{1} & \dots & v_{K} \end{matrix}) \end{matrix}

Similarly to [4], by properly choosing the vertex hunting algorithm and the anchor words condition, it can be seen that:

∥ \hat{V} - V ∥ \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}}

where we omit the permutation for simplicity here and throughout this proof. As a result:

\begin{matrix} ∥ {\hat{π}}_{j}^{*} - π_{j}^{*} ∥ & = ∥{\hat{Q}}^{- 1} (\begin{matrix} 1 \\ {\hat{r}}_{j} \end{matrix}) - Q^{- 1} (\begin{matrix} 1 \\ Ω r_{j} \end{matrix})∥ \\ \leq C ∥ Q^{- 1} ∥^{2} \cdot ∥ \hat{V} - V ∥ \cdot ∥ r_{j} ∥ + ∥ Q^{- 1} ∥ ∥ {\hat{r}}_{j} - Ω r_{j} ∥ \\ \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} = o (1) \end{matrix}

where we used the fact that

∥ Q^{- 1} ∥ \leq C

whose details can be found in the proof of Lemma G.1 in supplementary material of [4]. Considering the truncation at 0, it is not hard to see that:

∥ {\tilde{π}}_{j}^{*} - π_{j}^{*} ∥ \leq C ∥ {\hat{π}}_{j}^{*} - π_{j}^{*} ∥ \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} = o (1);

and furthermore:

\begin{matrix} ∥ {\hat{π}}_{j} - π_{j} ∥_{1} & \leq \frac{∥ {\tilde{π}}_{j}^{*} - π_{j}^{*} ∥_{1}}{∥ {\tilde{π}}_{j}^{*} ∥_{1}} + \frac{∥ π_{j}^{*} ∥_{1} | ∥ {\tilde{π}}_{j}^{*} ∥_{1} - {∥ π_{j}^{*} ∥}_{1} |}{∥ {\tilde{π}}_{j}^{*} ∥_{1} {∥ π_{j}^{*} ∥}_{1}} \\ \leq C ∥ {\tilde{π}}_{j}^{*} - π_{j}^{*} ∥_{1} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} . \end{matrix}

(A65)

by noticing that

π_{j} = π_{j}^{*}

in the oracle case.

Recall that

\tilde{A} = M^{1 / 2} diag ({\hat{ξ}}_{1}) \hat{Π} = : {({\tilde{a}}_{1}, \dots, {\tilde{a}}_{p})}^{'}

. Let

A^{*} = M_{0}^{1 / 2} diag (ξ_{1}) Π = {(a_{1}^{*}, \dots, a_{p}^{*})}^{'}

. Note that

A = A^{*} {[diag (1_{p} A^{*})]}^{- 1}

. We can derive:

\begin{matrix} ∥ {\tilde{a}}_{j} - a_{j}^{*} ∥_{1} & \leq ∥ \sqrt{M (j, j)} {\hat{ξ}}_{1} (j) {\hat{π}}_{j} - \sqrt{M_{0} (j, j)} ξ_{1} (j) π_{j} ∥_{1} \\ \leq C ∥ \sqrt{M (j, j)} - \sqrt{M_{0} (j, j)} ∥ \cdot ∥ ξ_{1} (j) ∥ \cdot ∥ π_{j} ∥_{1} + C \sqrt{M_{0} (j, j)} \cdot ∥ {\hat{ξ}}_{1} (j) - ξ_{1} (j) ∥ \cdot ∥ π_{j} ∥_{1} \\ + C \sqrt{M_{0} (j, j)} \cdot ∥ ξ_{1} (j) ∥ \cdot ∥ {\hat{π}}_{j} - π_{j} ∥_{1} \\ \leq C h_{j} \sqrt{\frac{p log (n)}{N n β_{n}^{2}}}, \end{matrix}

(A66)

where we used (A65), (A64) and also Lemma A1. Write

\tilde{A} = ({\tilde{A}}_{1}, \dots, {\tilde{A}}_{K})

and

A^{*} = (A_{1}^{*}, \dots, A_{K}^{*})

. We crudely bound:

\begin{matrix} | ∥ {\tilde{A}}_{k} ∥_{1} - ∥ A_{k}^{*} ∥_{1} | \leq \sum_{j = 1}^{p} {∥ {\tilde{a}}_{j} - a_{j}^{*} ∥}_{1} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} = o (1) \end{matrix}

(A67)

simultaneously for all

1 \leq k \leq K

, since

\sum_{j} h_{j} = K

. By the study of oracle case in [4], it can be deduced that

∥ A_{k}^{*} ∥_{1} ≍ 1

(see more details in the supplementary materials of [4]). It then follows that:

\begin{matrix} ∥ {\hat{a}}_{j} - a_{j} ∥_{1} & = ∥ diag (1 / ∥ {\tilde{A}}_{1} ∥_{1}, \dots, 1 / ∥ {\tilde{A}}_{K} ∥_{1}) {\tilde{a}}_{j} - diag (1 / ∥ A_{1}^{*} ∥_{1}, \dots, 1 / ∥ A_{K}^{*} ∥_{1}) a_{j}^{*} ∥_{1} \\ = \sum_{k = 1}^{K} | \frac{{\tilde{a}}_{j} (k)}{∥ {\tilde{A}}_{k} ∥_{1}} - \frac{a_{j}^{*} (k)}{∥ A_{k}^{*} ∥_{1}} | \\ \leq \sum_{k = 1}^{K} | \frac{{\tilde{a}}_{j} (k) - a_{j}^{*} (k)}{∥ A_{k}^{*} ∥_{1}} | + | a_{j}^{*} (k) | \frac{| ∥ {\hat{A}}_{k} ∥_{1} - {∥ A_{k}^{*} ∥}_{1} |}{∥ A_{k}^{*} ∥_{1} {∥ {\tilde{A}}_{k} ∥}_{1}} \\ \leq C \sum_{k = 1}^{K} ∥ {\tilde{a}}_{j} - a_{j}^{*} ∥_{1} + ∥ a_{j}^{*} ∥_{1} max_{k} | ∥ {\tilde{A}}_{k} ∥_{1} - {∥ A_{k}^{*} ∥}_{1} | \\ \leq C h_{j} \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} = C {∥ a_{j} ∥}_{1} \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} . \end{matrix}

Here, we used (A66), (A67) and the following estimate:

∥ a_{j}^{*} ∥_{1} = \sqrt{M_{0} (j, j)} | ξ_{1} (j) | ∥ π_{j}^{*} ∥ ≍ h_{j}

Combining all j together, we immediately have the result for

L (\hat{A}, A)

. □

Appendix E.3. Proof of Theorem 4

The optimization in (12) has a explicit solution given by:

{\hat{w}}_{i}^{*} = {({\hat{A}}^{'} M^{- 1} \hat{A})}^{- 1} {\hat{A}}^{'} M^{- 1} d_{i} .

Notice that

{(A^{'} M_{0}^{- 1} A)}^{- 1} A^{'} M_{0}^{- 1} d_{i}^{0} = {(A^{'} M_{0}^{- 1} A)}^{- 1} A^{'} M_{0}^{- 1} A w_{i} = w_{i}

. Consequently:

\begin{matrix} ∥ {\hat{w}}_{i}^{*} - w_{i} ∥_{1} & = ∥ {({\hat{A}}^{'} M^{- 1} \hat{A})}^{- 1} {\hat{A}}^{'} M^{- 1} d_{i} - {(A^{'} M_{0}^{- 1} A)}^{- 1} A^{'} M_{0}^{- 1} d_{i}^{0} ∥_{1} \\ \leq ∥ {(A^{'} M_{0}^{- 1} A)}^{- 1} ({\hat{A}}^{'} M^{- 1} \hat{A} - A^{'} M_{0}^{- 1} A) {({\hat{A}}^{'} M^{- 1} \hat{A})}^{- 1} {\hat{A}}^{'} M^{- 1} d_{i} ∥_{1} \\ + ∥ {(A^{'} M_{0}^{- 1} A)}^{- 1} ({\hat{A}}^{'} M^{- 1} d_{i} - A^{'} M_{0}^{- 1} d_{i}^{0}) ∥_{1} \\ \leq C β_{n}^{- 1} ∥ ({\hat{A}}^{'} M^{- 1} \hat{A} - A^{'} M_{0}^{- 1} A) ∥ (∥ {\hat{w}}_{i}^{*} - w_{i} ∥_{1} + ∥ w_{i} ∥_{1}) \\ + C β_{n}^{- 1} ∥ {\hat{A}}^{'} M^{- 1} d_{i} - A^{'} M_{0}^{- 1} d_{i}^{0} ∥, \end{matrix}

(A68)

since

∥ {(A^{'} M_{0}^{- 1} A)}^{- 1} ∥ ≍ ∥ {(A^{'} H^{- 1} A)}^{- 1} ∥ ≍ 1

. What remains is to analyze:

T_{1} : = ∥ ({\hat{A}}^{'} M^{- 1} \hat{A} - A^{'} M_{0}^{- 1} A) ∥, and T_{2} : = ∥ {\hat{A}}^{'} M^{- 1} d_{i} - A^{'} M_{0}^{- 1} d_{i}^{0} ∥ .

For

T_{1}

, we bound:

\begin{matrix} T_{1} & \leq ∥ {(\hat{A} - A)}^{'} M^{- 1} \hat{A} ∥ + ∥ A^{'} (M^{- 1} - M_{0}^{- 1}) \hat{A} ∥ \\ + ∥ A^{'} M_{0}^{- 1} (\hat{A} - A) ∥ . \end{matrix}

Using the estimates:

\begin{matrix} ∥ {\hat{a}}_{j} - a_{j} ∥_{1} \leq C h_{j} \sqrt{\frac{p log (n)}{N n β_{n}^{2}}}, | M {(j, j)}^{- 1} - M_{0} {(j, j)}^{- 1} | \leq \frac{\sqrt{log (n)}}{h_{j} \sqrt{N n h_{j}}}, \end{matrix}

it follows that:

\begin{matrix} ∥ A^{'} (M^{- 1} - M_{0}^{- 1}) (\hat{A} - A) ∥ & \leq \sum_{k, k_{1} = 1}^{K} | A_{k} (M^{- 1} - M_{0}^{- 1}) ({\hat{A}}_{k_{1}} - A_{k_{1}}) | \\ ≪ \sum_{k = 1}^{K} ∥ {\hat{A}}_{k} - A_{k} ∥_{1} = \sum_{j = 1}^{p} {∥ {\hat{a}}_{j} - a_{j} ∥}_{1} \\ ≪ \sqrt{\frac{p log (n)}{N n β_{n}^{2}}}, \end{matrix}

and similarly:

\begin{matrix} ∥ {(\hat{A} - A)}^{'} M_{0}^{- 1} (\hat{A} - A) ∥ ≪ \sum_{k = 1}^{K} {∥ {\hat{A}}_{k} - A_{k} ∥}_{1} ≪ \sqrt{\frac{p log (n)}{N n β_{n}^{2}}}, \\ ∥ {(\hat{A} - A)}^{'} (M^{- 1} - M_{0}^{- 1}) (\hat{A} - A) ∥ ≪ \sum_{k = 1}^{K} {∥ {\hat{A}}_{k} - A_{k} ∥}_{1} ≪ \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} . \end{matrix}

As a result:

\begin{matrix} T_{1} & \leq C ∥ {(\hat{A} - A)}^{'} M_{0}^{- 1} A ∥ + C ∥ A^{'} (M^{- 1} - M_{0}^{- 1}) A ∥ \\ \leq C \sum_{j = 1}^{p} ∥ {\hat{a}}_{j} - a_{j} ∥_{1} + C \sqrt{\frac{p log (n)}{N n}} \cdot \sum_{j = 1}^{p} {∥ a_{j} ∥}_{1} \\ \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} . \end{matrix}

(A69)

Next, for

T_{2}

, we bound:

\begin{matrix} T_{2} & \leq ∥ {(\hat{A} - A)}^{'} M^{- 1} d_{i} ∥ + ∥ A^{'} (M^{- 1} - M_{0}^{- 1}) d_{i} ∥ + ∥ A^{'} M_{0}^{- 1} (d_{i} - d_{i}^{0}) ∥ \\ \leq max_{j} (\frac{∥ {\hat{a}}_{j} - a_{j} ∥_{1}}{h_{j}} + {∥ a_{j} ∥}_{1} \frac{\sqrt{log (n)}}{h_{j} \sqrt{N n h_{j}}}) \cdot {∥ d_{i} ∥}_{1} + max_{1 \leq k \leq K} | A_{k}^{'} M_{0}^{- 1} (d_{i} - d_{i}^{0}) | \\ \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} + max_{1 \leq k \leq K} | A_{k}^{'} M_{0}^{- 1} (d_{i} - d_{i}^{0}) | . \end{matrix}

where for

{(\hat{A} - A)}^{'} M^{- 1} d_{i}

, given the low-dimension K, we crudely bound:

∥ {(\hat{A} - A)}^{'} M^{- 1} d_{i} ∥ \leq C max_{k} | {({\hat{A}}_{k} - A_{k})}^{'} M^{- 1} d_{i} | \leq C max_{k, j} | h_{j}^{- 1} ({\hat{a}}_{j} (k) - a_{j} (k)) | {∥ d_{i} ∥}_{1}

and

| {\hat{a}}_{j} (k) - a_{j} (k) | \leq {∥ {\hat{a}}_{j} - a_{j} ∥}_{1}

. We bound

∥ A^{'} (M^{- 1} - M_{0}^{- 1}) d_{i} ∥

in the same manner. To proceed, we analyze

| A_{k}^{'} M_{0}^{- 1} (d_{i} - d_{i}^{0}) |

for a fixed k. We rewrite it as:

\begin{matrix} A_{k}^{'} M_{0}^{- 1} (d_{i} - d_{i}^{0}) = \frac{1}{N_{i}} \sum_{m = 1}^{N_{i}} A_{k}^{'} M_{0}^{- 1} (T_{i m} - T_{i m}) . \end{matrix}

The RHS is an independent sum where Bernstein inequality can be applied. By elementary computations, the variance is:

\begin{matrix} N_{i}^{- 1} var (A_{k}^{'} M_{0}^{- 1} (T_{i m} - T_{i m})) & = N_{i}^{- 1} E {(A_{k}^{'} M_{0}^{- 1} (T_{i m} - T_{i m}))}^{2} \\ = N_{i}^{- 1} A_{k}^{'} M_{0}^{- 1} diag (d_{i}^{0}) M_{0}^{- 1} A_{k} - N_{i}^{- 1} {(A_{k}^{'} M_{0}^{- 1} d_{i}^{0})}^{2} \\ \leq N^{- 1} \end{matrix}

and the individual bound is crudely

N^{- 1}

. It follows from Bernstein inequality that with probability

1 - o (n^{- 4})

:

∥ A_{k}^{'} M_{0}^{- 1} (d_{i} - d_{i}^{0}) ∥ \leq C (\sqrt{\frac{log (n)}{N}} + \frac{log (n)}{N}) \leq C \sqrt{\frac{log (n)}{N}}

in light of

N ≫ log (n)

. This gives rise to:

\begin{matrix} T_{2} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{2}}} + C \sqrt{\frac{log (n)}{N}} \end{matrix}

We substitute the above equation, together with (A69), into (A68) and conclude that:

∥ {\hat{w}}_{i}^{*} - w_{i} ∥_{1} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{4}}} + C \sqrt{\frac{log (n)}{N β_{n}^{2}}} .

Recall that the actual estimator

{\hat{w}}_{i}

is defined by:

{\hat{w}}_{i} = max {{\hat{w}}_{i}^{*}, 0} / {∥ max {{\hat{w}}_{i}^{*}, 0} ∥}_{1},

where the maximum is taken entry-wisely. We write

{\tilde{w}}_{i} : = max {{\hat{w}}_{i}^{*}, 0}

for short. Since

w_{i}

is always non-negative, it is not hard to see that:

∥ {\tilde{w}}_{i} - w_{i} ∥_{1} \leq C {∥ {\hat{w}}_{i}^{*} - w_{i} ∥}_{1} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{4}}} + C \sqrt{\frac{log (n)}{N β_{n}^{2}}} = o (1) .

As a result,

∥ {\tilde{w}}_{i} ∥_{1} = 1 + o (1)

. Moreover:

\begin{matrix} ∥ {\hat{w}}_{i} - w_{i} ∥_{1} & \leq \frac{∥ {\tilde{w}}_{i} - w_{i} ∥_{1}}{∥ {\tilde{w}}_{i} ∥_{1}} + {∥ w_{i} ∥}_{1} | \frac{1}{∥ {\tilde{w}}_{i} ∥_{1}} - \frac{1}{∥ w_{i} ∥_{1}} | \\ \leq C ∥ {\tilde{w}}_{i} - w_{i} ∥_{1} \leq C \sqrt{\frac{p log (n)}{N n β_{n}^{4}}} + C \sqrt{\frac{log (n)}{N β_{n}^{2}}} \end{matrix}

with probability

1 - o (n^{- 4})

. Combining all i, we thus conclude the proof. □

References

Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the International ACM SIGIR Conference, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar]
Blei, D.; Ng, A.; Jordan, M. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Ke, Z.T.; Ji, P.; Jin, J.; Li, W. Recent advances in text analysis. Annu. Rev. Stat. Its Appl. 2023, 11, 347–372. [Google Scholar] [CrossRef]
Ke, Z.T.; Wang, M. Using SVD for topic modeling. J. Am. Stat. Assoc. 2024, 119, 434–449. [Google Scholar] [CrossRef]
de la Pena, V.H.; Montgomery-Smith, S.J. Decoupling inequalities for the tail probabilities of multivariate U-statistics. Ann. Probab. 1995, 23, 806–816. [Google Scholar] [CrossRef]
Arora, S.; Ge, R.; Moitra, A. Learning topic models–going beyond SVD. In Proceedings of the Foundations of Computer Science (FOCS), New Brunswick, NJ, USA, 20–23 October 2012; pp. 1–10. [Google Scholar]
Arora, S.; Ge, R.; Halpern, Y.; Mimno, D.; Moitra, A.; Sontag, D.; Wu, Y.; Zhu, M. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 280–288. [Google Scholar]
Bansal, T.; Bhattacharyya, C.; Kannan, R. A provable SVD-based algorithm for learning topics in dominant admixture corpus. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1997–2005. [Google Scholar]
Bing, X.; Bunea, F.; Wegkamp, M. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. Bernoulli 2020, 26, 1765–1796. [Google Scholar] [CrossRef]
Erdős, L.; Knowles, A.; Yau, H.T.; Yin, J. Spectral statistics of Erdős–Rényi graphs I: Local semicircle law. Ann. Probab. 2013, 41, 2279–2375. [Google Scholar] [CrossRef]
Fan, J.; Wang, W.; Zhong, Y. An L-infinity eigenvector perturbation bound and its application to robust covariance estimation. J. Mach. Learn. Res. 2018, 18, 1–42. [Google Scholar]
Fan, J.; Fan, Y.; Han, X.; Lv, J. SIMPLE: Statistical inference on membership profiles in large networks. J. R. Stat. Soc. Ser. B. 2022, 84, 630–653. [Google Scholar] [CrossRef]
Abbe, E.; Fan, J.; Wang, K.; Zhong, Y. Entrywise eigenvector analysis of random matrices with low expected rank. Ann. Statist. 2020, 48, 1452–1474. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Chi, Y.; Fan, J.; Ma, C. Spectral methods for data science: A statistical perspective. Found. Trends® Mach. Learn. 2021, 14, 566–806. [Google Scholar] [CrossRef]
Ke, Z.T.; Wang, J. Optimal network membership estimation under severe degree heterogeneity. arXiv 2022, arXiv:2204.12087. [Google Scholar]
Paul, D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Stat. Sin. 2007, 17, 1617. [Google Scholar]
Zipf, G.K. The Psycho-Biology of Language: An Introduction to Dynamic Philology; Routledge: London, UK, 2013. [Google Scholar]
Davis, C.; Kahan, W.M. The rotation of eigenvectors by a perturbation. III. SIAM J. Numer. Anal. 1970, 7, 1–46. [Google Scholar] [CrossRef]
Horn, R.; Johnson, C. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
Jin, J. Fast community detection by SCORE. Ann. Statist. 2015, 43, 57–89. [Google Scholar] [CrossRef]
Ke, Z.T.; Jin, J. Special invited paper: The SCORE normalization, especially for heterogeneous network and text data. Stat 2023, 12, e545. [Google Scholar] [CrossRef]
Donoho, D.; Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 13–18 December 2004; pp. 1141–1148. [Google Scholar]
Araújo, M.C.U.; Saldanha, T.C.B.; Galvao, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
Jin, J.; Ke, Z.T.; Moryoussef, G.; Tang, J.; Wang, J. Improved algorithm and bounds for successive projection. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Wu, R.; Zhang, L.; Tony Cai, T. Sparse topic modeling: Computational efficiency, near-optimal algorithms, and statistical inference. J. Am. Stat. Assoc. 2023, 118, 1849–1861. [Google Scholar] [CrossRef]
Klopp, O.; Panov, M.; Sigalla, S.; Tsybakov, A.B. Assigning topics to documents by successive projections. Ann. Stat. 2023, 51, 1989–2014. [Google Scholar] [CrossRef]
Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications; Cambridge University Press: Cambridge, UK, 2012; pp. 210–268. [Google Scholar]
Tropp, J. User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 2012, 12, 389–434. [Google Scholar] [CrossRef]
De la Pena, V.; Giné, E. Decoupling: From Dependence to Independence; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
Freedman, D.A. On tail probabilities for martingales. Ann. Probab. 1975, 3, 100–118. [Google Scholar] [CrossRef]
Bloemendal, A.; Knowles, A.; Yau, H.T.; Yin, J. On the principal components of sample covariance matrices. Probab. Theory Relat. Fields 2016, 164, 459–552. [Google Scholar] [CrossRef]

Figure 1. An illustration of Topic-SCORE in the noiseless case (

K = 3

). The blue dots are

r_{j} \in R^{K - 1}

(word embeddings constructed from the population singular vectors), and they are contained in a simplex with K vertices. This simplex can be recovered from a vertex hunting algorithm. Given this simplex, each

r_{j}

has a unique barycentric coordinate

π_{j} \in R^{K}

. The topic matrix A is recovered from stacking together these

π_{j}

’s and utilizing

M_{0}

and

ξ_{1}

.

Figure 1. An illustration of Topic-SCORE in the noiseless case (

K = 3

). The blue dots are

r_{j} \in R^{K - 1}

(word embeddings constructed from the population singular vectors), and they are contained in a simplex with K vertices. This simplex can be recovered from a vertex hunting algorithm. Given this simplex, each

r_{j}

has a unique barycentric coordinate

π_{j} \in R^{K}

. The topic matrix A is recovered from stacking together these

π_{j}

’s and utilizing

M_{0}

and

ξ_{1}

.

Table 1. A summary of the existing theoretical results for estimating A (n is the number of documents, p is the vocabulary size, N is the order of document lengths, and

h_{max}

and

h_{min}

are the same as in (8)). Cases 1–3 refer to

N \geq p^{4 / 3}

,

p \leq N < p^{4 / 3}

, and

N < p

, respectively. For Cases 2–3, the sub-cases ‘a’ and ‘b’ correspond to

n \geq max {N p^{2}, p^{3}, N^{2} p^{5}}

and

n < max {N p^{2}, p^{3}, N^{2} p^{5}}

, respectively. We have translated the results in each paper to the bounds on

L (\hat{A}, A)

, with any logarithmic factor omitted.

Table 1. A summary of the existing theoretical results for estimating A (n is the number of documents, p is the vocabulary size, N is the order of document lengths, and

h_{max}

and

h_{min}

are the same as in (8)). Cases 1–3 refer to

N \geq p^{4 / 3}

,

p \leq N < p^{4 / 3}

, and

N < p

, respectively. For Cases 2–3, the sub-cases ‘a’ and ‘b’ correspond to

n \geq max {N p^{2}, p^{3}, N^{2} p^{5}}

and

n < max {N p^{2}, p^{3}, N^{2} p^{5}}

, respectively. We have translated the results in each paper to the bounds on

L (\hat{A}, A)

, with any logarithmic factor omitted.

	Case 1	Case 2a	Case 2b	Case 3a	Case 3b
Ke & Wang [4]	$\sqrt{\frac{p}{N n}}$	$\sqrt{\frac{p}{N n}}$	$\frac{p^{2}}{N \sqrt{N}} \sqrt{\frac{p}{N n}}$	$\frac{p}{N} \sqrt{\frac{p}{N n}}$	$\frac{p^{2}}{N \sqrt{N}} \sqrt{\frac{p}{N n}}$
Arora et al. [6]	$\frac{p^{4}}{\sqrt{N n}}$	$\frac{p^{4}}{\sqrt{N n}}$	$\frac{p^{4}}{\sqrt{N n}}$	$\frac{p^{4}}{\sqrt{N n}}$	$\frac{p^{4}}{\sqrt{N n}}$
Bing et al. [9]	$\sqrt{\frac{p}{N n}} \cdot \frac{h_{max}}{h_{min}}$	$\sqrt{\frac{p}{N n}} \cdot \frac{h_{max}}{h_{min}}$	$\sqrt{\frac{p}{N n}} \cdot \frac{h_{max}}{h_{min}}$	NA	NA
Bansal et al. [8]	$N \sqrt{\frac{p}{n}}$	$N \sqrt{\frac{p}{n}}$	$N \sqrt{\frac{p}{n}}$	$N \sqrt{\frac{p}{n}}$	$N \sqrt{\frac{p}{n}}$
Our results	$\sqrt{\frac{p}{N n}}$	$\sqrt{\frac{p}{N n}}$	$\sqrt{\frac{p}{N n}}$	$\sqrt{\frac{p}{N n}}$	$\sqrt{\frac{p}{N n}}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ke, Z.T.; Wang, J. Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents. Mathematics 2024, 12, 1682. https://doi.org/10.3390/math12111682

AMA Style

Ke ZT, Wang J. Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents. Mathematics. 2024; 12(11):1682. https://doi.org/10.3390/math12111682

Chicago/Turabian Style

Ke, Zheng Tracy, and Jingming Wang. 2024. "Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents" Mathematics 12, no. 11: 1682. https://doi.org/10.3390/math12111682

APA Style

Ke, Z. T., & Wang, J. (2024). Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents. Mathematics, 12(11), 1682. https://doi.org/10.3390/math12111682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents

Abstract

1. Introduction

1.1. Related Literature

1.2. Organization and Notations

2. Entry-Wise Eigenvector Analysis for Topic Models

2.1. A Normalized Data Matrix

2.2. Entry-Wise Singular Analysis for M − 1 / 2 D

3. Improved Rates for Topic Modeling

3.1. The Topic-Score Algorithm

3.2. The Improved Rates for Estimating A and W

3.3. Connections and Comparisons

4. Proof Ideas

4.1. Why the Leave-One-Out Technique Fails

4.2. The Proof Structure in [4] and Why It Is Not Sharp for Short Documents

4.3. Non-Stochastic Perturbation Analysis

4.4. Large-Deviation Analysis of ( G − G 0 )

4.5. Proof Sketch of Theorem 1

5. Summary and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Preliminary Lemmas and Theorems

Appendix B. Proofs of Lemmas 1 and 2

Appendix B.1. Proof of Lemma 1

Appendix B.2. Proof of Lemma 2

Appendix C. The Complete Proof of Theorem 1

Appendix D. Entry-Wise Eigenvector Analysis and Proof of Lemma A3

Appendix D.1. Proof of Lemma A3

Appendix D.2. Proof of Lemma A4

Appendix D.3. Proof of Lemma A5

Appendix D.4. Proof of Lemma A6

Appendix E. Proofs of the Rates for Topic Modeling

Appendix E.1. Proof of Theorem 2

Appendix E.2. Proof of Theorem 3

Appendix E.3. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. Entry-Wise Singular Analysis for $M^{- 1 / 2} D$

4.4. Large-Deviation Analysis of $(G - G_{0})$