Predicting Question Popularity for Community Question Answering

Wu, Yuehong; Wen, Zhiwei; Liang, Shangsong

doi:10.3390/electronics13163260

Open AccessArticle

Predicting Question Popularity for Community Question Answering

by

Yuehong Wu

¹

,

Zhiwei Wen

² and

Shangsong Liang

^2,*

¹

School of Law, Guangdong University of Technology, Guangzhou 510520, China

²

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3260; https://doi.org/10.3390/electronics13163260

Submission received: 2 June 2024 / Revised: 5 July 2024 / Accepted: 22 July 2024 / Published: 16 August 2024

(This article belongs to the Special Issue Natural Language Processing and Information Retrieval, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we study the problem of predicting popularities of questions in Community Question Answering (CQA). To address this problem, we propose a Posterior Attention Recurrent Point Process Model (PARPP) to take both the interaction of users and the Matthew effect into account for question popularity prediction. Our PARPP uses long short-term memory (LSTM) to encode the observed history and another LSTM network to record each step of decoding information. At each decoding step, it uses prior attention to capture answers that have a greater impact on the problem. When a new answer is observed, it uses Bayes’ rule to modify prior attention and obtain posterior attention. Then, the posterior attention is used to update the decoding status. We further introduce a convergence strategy to capture the Matthew effect in CQA. We conduct experiments on a Zhihu dataset crawled from a famous Chinese CQA forum. The experimental results show that our model outperforms several state-of-the-art methods. We further analyze the attention mechanism in our model. Our analysis shows that the proposed attention mechanism can better capture the impact of each answer on the future popularity of the question, which makes our model more interpretable. Our study would shed light on other similar studies such as answer ranking in response to the question and finding experts who have expertise on the topics of the questions.

Keywords:

point process; attention mechanism; Community Question Answering; cascade prediction; Matthew effect

1. Introduction

In recent years, Community Question Answering (CQA) forums, such as Zhihu, http://zhihu.com/, Quora, https://www.quora.com/, and yahoo! answer, https://answers.yahoo.com/, have been booming and become important online platforms for users to exchange ideas and find information. CQA provides a convenient platform where people can ask their own questions and share their answers. Moreover, like most social networks, users in CQA can interact with each other; e.g., users can vote and write reviews for their favorite answers. In CQA, given a question, its accumulated interactions will have an impact on the ordering of its answers. In recent years, research on CQA has been focusing on question retrieval and answer ranking to help users find relevant questions and high-quality answers better and faster. Predicting question popularity is of great importance to allow communities to better organize their questions both to promote community development and to improve user experience.

Previous work [1] on CQA addresses a wide spectrum of applications such as application question retrieval [2], answer selection [3], heterogeneous question answering community detection [4], and the automatic moderation of user-generated content [5]. Not many studies have focused on predicting question popularity. Yet, predicting question popularity, such as when the next answer appears or how many answers will appear in the future, is crucial to the development of the community. Accurate prediction would allow the community to rank content better, discover the trends of questions better, further improve its content-delivery networks and more effectively push ads. Existing predicting question popularity methods simply perform a qualitative analysis of a problem’s popularity [6,7]; i.e., they only determine whether a given problem will be popular but lack a more fine-grained quantitative analysis.

Given a question and the trace of its past answers, when will the next answer appear? How many answers will be provided by users to the question? A number of temporal stochastic process models [8] have been proposed and proven to be able to effectively capture the arrival of information in online media. Although these models have made a great progress in predicting the popularity of online cascade, they suffer from the following three drawbacks:

All these temporal point process models do not take into account the impact of user interaction on an event cascade in the online community. An important feature of the online community is that people can interact with others anytime, anywhere. For instance, in CQA, users can vote for their favorite answers or leave their own opinions in the comments section. These interactions will affect the ranking of the answers. Figure 1 shows answers of a question related to interpersonal socialization in Zhihu, which is a famous Chinese Community Question Answering platform. As we can see, top-ranked answers, though they were posed a long time ago (four years before), received more than 10k thumbs-up votes, so they were ranked ahead. In contrast, new answers received a small number of votes, and thus they are ranked at the bottom and unobserved at the top. It is generally believed that the impact of recent events in the future is greater than that of past events [8,9,10], but due to the special interaction mechanism in CQA, a long-term but high-quality answer will gain users’ attention because of the higher ranking, which will have a lasting and important impact on the future of a question.
There is an obvious Matthew effect [11] in CQA, which has not been considered in previous models. The Matthew effect is a concept originating in the biblical Gospel of Matthew. It is sometimes summarized by the adage “the rich get richer, while the poor get poorer”. Robert Merton, a sociologist, first observed the Matthew effect in the academic field in that eminent scientists will often receive more credit than a comparatively unknown researcher even if their works are similar. The Matthew effect has been found in many aspects of our social life [12,13,14,15,16,17,18].
In CQA, this effect is very common and is amplified by answer visibilities. The rapid development of online social media has brought about the outbreak of information. However, people are usually only attracted by a few top-rank high-quality answers, while the rest go unnoticed [19]. Moreover, these excellent answers will accumulate advantages, while a higher ranking makes them easier to attract users’ attention and obtain a better ranking. Due to the existence of the Matthew effect, an early dominant answer will continue to widen the gap with other answers. As shown in Figure 1, the number of votes for the first answer is nearly five times more than that of the second answer. The Matthew effect affects the ranking of answers, and ranking affects the users’ first impression of a question, which in turn affects the popularity of the question. However, existing approaches do not take into account this common phenomenon in CQA.
For the convenience of analyzing information diffusion, with very few exceptions [8,20,21], most existing point process models make specific assumptions about the functional forms of the underlying generative processes. For example, the Hawkes process [22,23,24], a famous non-homogenous Poisson process, supposes that the arrival of an event causes the probability of future events to increase, assuming that historical events influence future events through a positive, additive and time decay kernel function. However, the true mechanism behind event cascade is hard to specify or verify in practice, and thus, the fixed simple parametric representations may constrain the expressive power of these models.

Accordingly, we propose a Posterior Attention Recurrent Point Process model, abbreviated as PARPP, that takes into account both user interactions and the Matthew effect for predicting question popularity in CQA. Specifically, inspired by the common encoding–decoding framework in language translation [25,26], our PARPP uses long short-term memory (LSTM) to encode the observed history and another LSTM network to record each step of decoding information. At each decoding step, we use prior attention to capture answers that have a greater impact on the problem at this time. When a new answer is observed, we use Bayes’ rule to modify prior attention and obtain posterior attention at this time. Then, posterior attention is used to update the decoding status. Furthermore, we introduce a convergence strategy in the proposed attention mechanism to capture the Matthew effect. That is, we focus our attention on the direction of high values to simulate the phenomenon that high-ranking answers will continue to accumulate advantages. We have four motivations for applying attention mechanisms in predicting the popularity of community Q&A questions: (1) not all answers in response to the questions contribute equally to the prediction of the popularity; (2) the popularity of a question can depend on the different Matthew effect associated with it; (3) the factors influencing Q&A question popularity can change over time in the answers; and (4) without an attention mechanism to focus on relevant answers, the popularity prediction model may dilute its predictive power by considering too many irrelevant answers. To evaluate the performance of our proposed PARPP model, we conduct experiments on a Zhihu dataset and aim at answering these research questions: Given a question and the trace of its observed answers, when will the next answer appear? How many answers will be posted for a particular period? Can the proposed attention mechanism capture the most influential answers?

Our contributions can be summarized as the following:

We propose a Posterior Attention Recurrent Point Process Model, PARPP, to predict the popularity of questions in CQA. Our PARPP is a new point process model that takes into account the impact of user interaction in the modeling of the point process.
In CQA, users’ interactions will affect the ranking of the answers and thus the popularity of the question. Our proposed PARPP fully utilizes users’ interactions by a posterior attention mechanism.
We explore the Matthew effect existing in CQA, such that a higher praised answer will accumulate advantages and gain more attention from users. To our knowledge, we are the first to study the Matthew effect in CQA.
To model the Matthew effect, we further introduce a convergence strategy in the proposed attention-based model that yields better prediction performance.
Our proposed model, PARPP, yields powerful results compared to existing state-of-the-art methods in two cascade prediction tasks. The analysis of the experimental results shows that the learned alignments in our model can accurately capture the changes of users’ attention to different answers over time.

The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3.1 defines the research problem we study. Section 3 details the proposed Posterior Attention Recurrent Point Process model (PARPP). Section 4 describes our experimental setup. Section 5 discusses our experimental results. Finally, Section 6 concludes the paper.

2. Related Work

In this section, we discuss five lines of the most related work: Community Question Answering, cascade prediction, point process, attention mechanism, and Matthew effect.

2.1. Community Question Answering

Community Question Answering (CQA) has seen a spectacular increase in popularity recently. With the advent and popularity of sites like Zhihu, Quora, and Stack Overflow, an increasing number of people now use these web forums to seek answers to their questions and share their views. Like other online social networking sites, users in CQA can interact with each other. Users can not only post relevant (or irrelevant) answers based on their opinions/expertise but also upvote or downvote answers from other users based on the validity, significance and content of the responses.

With the explosive growth in the number of questions and answers, how to find the most useful information for users has become a challenge. A number of related algorithms have been proposed, which can be divided into two categories. (i) The first is question retrieval, which aims to find the existing questions that are semantically equivalent or relevant to the queried questions from users. Specifically, Zhou et al. [27] proposed a way to use the semantic relations extracted from the world knowledge of Wikipedia in order to enhance the question similarity in the concept space. Zhou et al. [28] proposed a framework using Fisher kernel to learn continuous word embeddings with metadata of category information within CQA pages for question retrieval. In [29], a convolutional neural tensor network architecture was proposed to encode the sentences in semantic space and model their interactions with a tensor layer. (ii) The second category is answer selection, which aims to pick out good answers from a set of candidate answers. In CQA, a question is often represented as two parts: a subject that summarizes the main points of the question and a body that elaborates on the subject in detail. To make use of the subject–body relationship of community questions, Wu et al. [30] proposed the Question Condensing Networks. Zhou et al. [31] proposed a recurrent convolutional neural network to capture both the semantic matching between question and answer and the semantic correlations embedded in the sequence of answers. There are some other studies that focus on predicting question quality [32,33] and finding expert users in Community Question Answering [34,35]. More recently, Liska et al. [36] and Zhuang et al. [37] proposed a benchmark for adaptation to new knowledge over time in question-answering models. Via heterogeneous semantic fusion, Wu et al. [7] proposed a community answer recommendation, which combines text semantic and multi-aspect features to enhance answer recommendations.

Although research on CQA has been ongoing for some time, to our knowledge, there is very little existing studies in the literature on question popularity prediction. Existing methods just simply perform a qualitative analysis of the problem’s popularity [2,3,6,38]. Specifically, Quan et al. [38] used three popularity-related features of questions (potential hits, popular terms and tedious unpopular terms) in the proposed framework to identify popular questions. Liu et al. [6] proposed a supervised machine learning approach to predict question popularity by modeling content-related, user behavior and user profile features. However, these methods only work for simple qualitative analysis, i.e., to judge whether a question is a popular question. In practical application, we often need more fine-grained prediction: for example, given a question and the existing answers, when will the next new answer appear? How many new answers will appear? This is the focus of this paper.

2.2. Cascade Prediction

Predicting the popularity of questions in CQA can be seen as a cascade prediction problem, and research on this kind of problem is a rich and active field [39]. Recent models for predicting the size of an information cascades are generally characterized by two types of approaches: feature-based methods and point process-based methods. We briefly introduce the related work of feature-based methods here. The related work of point process methods are discussed in the next subsection.

Feature-based methods first construct an exhaustive list of potentially relevant features that contribute to the growth and spreading of cascades, such as content features, original poster/resharer features, structural features, and temporal features [38,40]. Then, different learning or statistical methods are applied to classify the cascades and predict their future size, including simple regression models [40], support vector machine (SVM) [41], probabilistic collaborative filtering [42], naive Bayes [43], and multi-scale decomposition [44]. Recent studies on cascade prediction include the use of hierarchical attention neural networks [45], feature-based models [39], and post hoc deferral mechanisms [46].

There are some drawbacks with such approaches: the features used for training need to be extracted manually, and it is difficult to confirm the quality of the selected features, which makes the effect of the models very sensitive to the features [47]. Instead, a point process models the formation of an information cascade in a network, avoiding expensive and cumbersome features extraction. Thus, in this work, we choose to use the point process to predict question popularity.

2.3. Point Process

Point processes [48] are collections of random points falling in some space whose values are “point patterns” such as time and location. Point processes provide statistical language to describe the time and properties of events and have been applied in many fields. In geophysics, the occurrences of earthquakes and aftershocks can be considered to be point processes, and suitable modeling of the conditional intensity function of a point process is useful for the investigation of various statistical features of seismic activity [49,50,51]. In biology, a spatio-temporal point process model is used to understand and predict a species’ distribution [52]. In finance, financial time-series data, such as the transactions of the stock, can be effectively modeled by point processes [53,54]. In the analysis of online social media, Zhao et al. [55] build on the theory of self-exciting point processes to develop a statistical model that allows us to make an accurate prediction of tweet popularity. Kobayashi and Lambiotte [56] take into account the circadian nature of the users and the aging of information and propose a time-dependent Hawkes process for predicting retweet dynamics.

For the convenience of analyzing discrete events in continuous time (“event streams”), typical point process models make specific assumptions about the functional forms of the underlying generative processes. For instance, a standard Possion process [57], the simplest point process, assumes that events occur independently of each other. The Hawkes process [23], a famous non-homogenous point process, supposes that the arrival of an event causes the probability of future events to increase, assuming that historical events influence future events through a positive, additive and time decay kernel function. On the contrary, a self-correcting point process [58] assumes that the occurrence of past points inhibits the occurrence of future points. However, real-world patterns often seem to violate these assumptions. For instance, in CQA, a controversial answer will inspire more answers to refute it, while for a good answer, people may prefer to upvote it rather than to write a new one. In this case, it is difficult to express the influence of an answer on the subsequent events in a fixed function. Although traditional point process methods have achieved certain results in many fields, these methods rely too much on the hypothesis of the underlying diffusion model, which greatly limits their expressive power.

Recent research studies have attempted to extend the expressive power of temporal point processes with deep learning techniques. Dheur et al. [59] propose RMTPP, which is a temporal point process that used a recurrent neural network (RNN) to embed an event history to a vector. Wang et al. [8] explore the cross-dependence problem when modeling a cascade dynamic with RNN and introduce an attention mechanism to solve the cross-dependence in a cascade. In [21], a competing recurrent point process is proposed for modeling visibility dynamics in information diffusion. Zuo et al. [22] propose a differentially private estimation for the Hawkes process. Borrajo et al. [60] define a goodness-of-fit test to check the adequacy of a model in point processes via nonparametric techniques. Kındap and Godsill [61] present novel point process simulation methods based on subordination with a generalized inverse Gaussian process and using a generalized shot-noise representation that involves the random thinning of infinite series of decreasing jump sizes.

However, all of these methods do not take into account the impact of user interaction on event streams. Especially in CQA, users’ behavior will form an obvious Matthew effect. To the best of our knowledge, we are the first to introduce the Matthew effect in the point process.

2.4. Attention Mechanism

The attention mechanism first emerged as an improvement over the encoder–decoder-based neural machine translation system in natural language processing (NLP). Typically, neural machine translation is based on encoder–decoder recurrent neural networks (RNNs) or long short-term memories (LSTMs) [62]. Usually, there are two RNNs/LSTMs, one we call the encoder, which is used to process the entire input sentence and encode it into a context vector, and the other we call the decoder, which uses context vectors as input and produces the words in a sentence one after another. However, RNNs cannot remember longer sentences and sequences due to the vanishing/exploding gradient problems. The performance of the encoder–decoder network degrades rapidly as the length of the input sentence increases [63]. Another problem is that there is no way to give more importance to some of the input words compared to others while translating the sentence. In order to solve these shortcomings, Bahdanau et al. [64] propose the first attention model, suggesting that not only can all the input words be taken into account in the context vector but relative importance should also be given to each one of them. This idea is called “Attention”. A more common understanding is that attention points out which input should be related to the output.

Although the attention mechanism comes from natural language translation, it can be applied in most sequence-to-sequence modeling, especially in the field of dialogue systems, abstract generated, point processes, etc. There are many variations of attention mechanisms. Soft attention is proposed for translation in [64] and refined further in [65,66]. It computes attention for each output as a polynomial distribution over the input states and uses the polynomial probabilities to weight the sum of input vectors. The weighted value serves as relevant context for the output and participates in subsequent calculations. Because soft attention is end-to-end differentiable and easy to implement, it has become the most widely used attention mechanism in sequence-to-sequence learning. Hard attention is another attention mechanism proposed in [67]. Unlike the soft attention mechanism, hard attention makes the output determined from a single input rather than an average of all inputs. An important drawback of hard attention is that it cannot be guided. Another important drawback of hard attention is that it cannot be differentiable. Some work [68,69] has been put forward to try to solve this problem. In addition, there are some other attention mechanisms. Luong et al. [65] attempt to bridge the gap between soft and hard attention and propose local attention that averages a window of input. Yang et al. [70] consider the relationship between the attentions at different time steps and propose a recurrent history mechanism. Some works model attention as latent alignments [71,72]. Because traditional attention mechanisms can only be serially calculated, in order to increase concurrency and improve computational efficiency, Vaswani et al. [73] propose a self-attention mechanism. Fazil et al. [74] propose an attention mechanism based a neural network model for socialbot detection. Mazzia et al. [75] propose an action transformer, which is a self-attention model, for short-time pose-based human action recognition. Shen et al. [66] propose an attention model to convert a continuous sign language video clip into a spoken language. Shaik et al. [76] propose an adaptive unlearning algorithm using an attention mechanism to adapt to changing data distributions and participant characteristics in single-modality and multimodality scenarios.

Prevalent attention architectures do not know what the real output is when calculating attention, because of which they are assumed to be unable to adequately model the dependence among the attention and output Shankar and Sarawagi [77]. In order to take the output into account for a more accurate alignment, Shankar and Sarawagi [77] propose posterior attention models to yield better predictions and alignment accuracy. It is worth mentioning that our work is inspired by them.

2.5. Matthew Effect

The term the Matthew effect was coined by a sociologist, Robert K. Merton [11], who took the name from the parable of the talents or minas in the biblical Gospel of Matthew. Robert K. Merton found that eminent scientists will often receive more credit than a comparatively unknown researcher even if their works are similar; it also means that credit will usually be given to researchers who are already famous. Merton furthermore argued that in the scientific community, the Matthew effect reaches beyond simple reputation to influence the wider communication system, playing a part in social selection processes and resulting in a concentration of resources and talent. Because of this, the Matthew effect has a more intuitive expression: “the rich get richer while the poor get poorer”.

The Matthew effect can be observed in many aspects of life and fields of activities. For instance, the wealth share owned by the top 1.0% of the richest families in the US has continued increasing over the years [14]. A study of science funding in the Netherlands discovered that winners just above the funding threshold were found to accumulate more than twice as much funding during the subsequent eight years as non-winners with near-identical review scores that fell just below the threshold [15]. In almost all European countries, the Matthew effect makes disadvantaged children less likely to use childcare than more advantaged children [16]. In social commerce, a disproportionately higher percentage of votes went to early-posted lengthy reviews due to the Matthew effect [17]. In education, a Matthew effect phenomenon was observed in which early success in acquiring reading skills usually leads to later successes in reading as the learner grows, while failing to learn to read before the third or fourth year of schooling may be indicative of lifelong problems in learning new skills [78]. Gómez-Bengoechea and Jung [79] explore the relationship between information and communication technologies diffusion and labor productivity at the firm level, and they found a Matthew effect on firms’ digitalization distributional effects.

However, as far as we know, little research has paid attention to analyze the Mathew effect in CQA. In fact, we also observed the existence of the Matthew effect in CQA. An early and quality answer will accumulate its advantage after it occupies the ranking advantage, which leads to a better but later answer having trouble attracting users’ attention. We will provide a more detailed analysis in later sections.

3. Model

In this section, we first provide the definition of our task; then, we detail our proposed Posterior Attention Recurrent Point Process model, PARPP, for predicting CQA popularity. We begin by the definition of the task, subsequently review the temporal point process model, and then describe our attention-based recurrent point process model.

3.1. Problem Formulation

The problem we aim to address is to predict the popularity of the problem based on its observed history. The input data are a collection of M question-answering cascades

C = {S^{i}}_{i = 1}^{M}

, with each cascade

S^{i} = {(t_{k}^{i}, m_{k}^{i}) | t_{k}^{i} \in [0, + \infty), k = 1, \dots, N^{i}}

being a sequence of answering behaviors of a question, where

t_{k}^{i}

is the time at which an answer is posted,

m_{k}^{i}

is the marker of the answer, and

N^{i}

is the total number of answers of a question that have appeared so far. In CQA, the marker includes the textual content associated with an answer and the number of the poster’s followers. Let the history

H_{k}

be the list of observed events up to the k-th answer and

N_{t_{k}}

be the number of answers up to time

t_{k}

. The objective of our model is to predict the time

t_{k + 1}

of the next answer and how many answers

N_{t_{k} + Δ t}

there will be in the future

t_{k} + Δ t

given a sequence of past answers

H_{k}

, which can be formally define as the following:

Predicting Questions’ Popularities in CQA. Given a question-answering history $H_{k}$ , the goal is to seek a function f that aims at predicting the time of the next answer and predicting the number of answers at the future time $t_{k} + Δ t$ :

$H_{k} = {(t_{1}, m_{1}), \dots, (t_{k}, m_{k})} \overset{f}{⟶} t_{k + 1}, N_{t_{k} + Δ t},$

where $Δ t$ is the window size we want to predict. Table 1 summarizes our main notations used across the paper.

3.2. Temporal Point Process

The temporal point process is a powerful stochastic model to capture event streams. One way to characterize a temporal point process is to specify the distribution function of the next arrival time conditional on the past. Given the history up to the last event

H_{k}

, the conditional cumulative distribution function

F (t | H_{k})

is

F (t | H_{k}) = \int_{t_{k}}^{t} P (t_{k + 1} \in [s, s + d s) | H_{k}) d s = \int_{t_{k}}^{t} f (s | H_{k}) d s,

(1)

where

f (s | H_{k})

is the conditional probability density function for the event’s next arrival time

t_{k + 1}

. The joint probability of an event cascade

t_{1}, t_{2}, \dots, t_{k}

is then obtained by the chain rule as

f (t_{1}, t_{2}, \dots, t_{k}) = \prod_{i = 1}^{k} f (t_{i} | H_{i - 1}) .

(2)

However, it is difficult to work with the conditional arrival distribution

f (t | H_{k})

. Instead, we use the conditional intensity function to characterize a temporal point process. Let

N_{t}

count the number of events up to time t; then, the conditional intensity function can be specified as

λ (t | H_{k}) = lim_{h \to 0} \frac{P {N_{t_{k} + h} - N_{t_{k}} = 1 | H_{k}}}{h} .

(3)

It turns out that the conditional intensity function

λ (t | H_{k})

in the above equation can be expressed in terms of the conditional density f and its corresponding conditional cumulative distribution function F:

λ (t | H_{k}) = \frac{f (t | H_{k})}{1 - F (t | H_{k})} .

(4)

As a consequence, the conditional density function can be alternatively specified by

f (t | H_{k}) = λ (t | H_{k}) exp (- \int_{t_{k}}^{t} λ (τ | H_{k}) d τ) .

(5)

Then, we can estimate the time of the next arriving event

{\tilde{t}}_{k + 1}

using the following expectation:

{\tilde{t}}_{k + 1} = \int_{t_{k}}^{+ \infty} t \cdot f (t | H_{k}) d t .

(6)

In general, the integration in Equation (6) does not have analytic solutions. In order to obtain the estimate of

t_{k + 1}

, we apply commonly used numerical integration techniques [80] to calculate Equation (6) instead. Conditional intensity function

λ (t | H_{k})

can be seen as the expected rate of arrivals conditioned on

H_{k}

, so we can predict the future event count

{\tilde{N}}_{t_{k} + Δ t}

as

{\tilde{N}}_{t_{k} + Δ t} - N_{t_{k}} = \int_{t_{k}}^{t_{k} + Δ t} λ (s | H_{k}) d s .

(7)

3.3. Recurrent Point Process

Our model extends the temporal point process by integrating recurrent neural networks (RNNs) in a unique way. An RNN is a feedforward neural network which has been proven to successfully model sequences. For each answer with the posting time

t_{k}

and the marker

m_{k}

, we vectorize the pair

(t_{k}, m_{k})

into

x_{k}

as the input to a recurrent neural network in order. After RNN’s nonlinear transformation, we can obtain the hidden state

h_{k} = R N N (x_{k}, h_{k - 1})

. The embedding

h_{k}

can be seen as the representative of the k-th answers, and the output is trained to maximize the likelihood of a CQA cascade,

P (S) = \prod_{k = 1}^{N - 1} P (t_{k + 1} | H_{k}) = \prod_{k = 1}^{N - 1} f (t_{k + 1} | h_{k}),

(8)

where

P (S)

denotes a question–answer cascade.

3.4. Attention Mechanism

In CQA, a question may have dozens or even hundreds of answers. Users with limited time would only be attracted by a small number of high-quality answers. Although these answers are often not the latest posts, they occupy the optimal sort position. Intuitively, we believe that these answers will have a greater impact on the popularity of the question. However, due to the chain structure of an RNN, it fails to effectively capture the impact of long history on the present. The attention mechanism is an effective way to solve this problem. The attention mechanism was originally used in the encoder–decoder-based neural machine translation system [64]. It is proposed as an improvement to solve the vanishing/exploding gradient problems and obtain a better alignment between output and input. We refer to the practice in natural language translation and use the attention mechanism in our model to find answers that have the greatest impact on the popularity of the question.

In the existing encoder–decoder model, a general way of applying attention mechanism is to calculate a context vector

c_{k}

:

c_{k} = \sum_{i = 1}^{k} a_{k, i} h_{i}, s . t \sum_{i = 1}^{k} a_{k, i} = 1,

(9)

where the weight of each attention variable

a_{k, i}

is computed as a function of the decoder state and encoder state as

a_{k, i} \propto e^{A_{θ} (h_{i}, s_{k})}

. Here,

A_{θ} (., .)

is an end-to-end trained function of an input embedding

h_{i}

and a decoder state

s_{k}

.

However, the treatment of the attention variables has been rather ad hoc. For instance, in natural language translation, attention is interpreted as the alignment of the source words and the target words. Only when we observe target words can we accurately know which source words it is aligned with. Therefore, in order to better capture the impact of historical events at the present, we propose a Posterior Attention Recurrent Point Process Model, PARPP, for CQA popularity prediction.

The framework of the proposed PARPP is shown in Figure 2. According to the figure, the input is a sequence of pairs of answers and the corresponding answers, i.e,

(t_{1}, m_{1}), \dots, (t_{k}, m_{k})

, and the output is the predicted timestamp that the next answer will appear at. The proposed PARPP encodes the input sequential data into sequential embeddings

h_{1}, h_{2}, \dots, h_{k}

via an encoder. Each embedding here is the representation of each answer as a whole rather than a word of an answer. To predict the timestamp of the next answer that will appear, our proposed model utilizes two attention layers, i.e., the “posterior attention” and the “prior attention” layers, which are the two main components in the decoder. In the decoder, the model takes together the inferred embeddings

h_{1}, h_{2}, \dots, h_{k}

into the posterior attention layer and outputs the inferred embedding vector

s_{k}

. In the prior attention layer, the input consists of a sequence of predicted timestamps

t_{k + 1}^{1}, t_{k + 1}^{2}, \dots, t_{k + 1}^{k}

that services the prediction of the timestamp when the next answer appears. Finally, the decoder of our model predicts the next timestamp

t_{k + 1}

of the next answer via weighting all the timestamps

t_{k + 1}^{1}, t_{k + 1}^{2}, \dots, t_{k + 1}^{k}

via the prior attention layer. We first factorize the conditional probability of the next answering behavior via chain rule as

\begin{matrix} P (t_{k + 1} | H_{k}) & = \sum_{a_{k} = 1}^{k} P (t_{k + 1}, a_{k} | H_{k}) = \sum_{a_{k} = 1}^{k} P (t_{k + 1} | a_{k}, H_{k}) P (a_{k} | H_{k}) \\ = \sum_{a_{k} = 1}^{k} P (t_{k + 1} | a_{k}, H_{k}) P r i o r_{k} (a_{k}), \end{matrix}

(10)

where we denote

P (a_{k} | H_{k})

as

P r i o r_{k} (a_{k})

and regard

P (a_{k} | H_{k})

as the prior attention

P r i o r_{k} (a_{k})

because it is the attention distribution before observing the next answer. According to the framework, we represent the conditional probability as

P (t_{k + 1} | a_{k}, H_{k}) = f (t; s_{k}, h_{a_{k}}) = λ (t) exp (- \int_{t_{k}}^{t} λ (τ) d τ) .

(11)

With

h_{a_{k}}

and the decoder state

s_{k}

, we can formulate the conditional intensity function by

\begin{matrix} λ (t) & = λ_{0} e^{- α (t - t_{k})} \\ s . t . λ_{0} & = exp (w_{λ}^{T} s_{k} + u_{λ}^{T} h_{a_{k}} + b_{λ}) \\ α & = exp (w_{α}^{T} s_{k} + u_{α} h_{a_{k}} + b_{α}), \end{matrix}

(12)

where the column vector

w_{λ}, u_{λ}, w_{α}, u_{α}

and scalar

b_{λ}, b_{α}

are parameters in the model,

λ_{0}

is the initial value of

λ (t)

and

α

is the decay rate. In order to avoid the explosion of

λ

, we limit

α

to be greater than zero.

When we observe the next answer, we can calculate the posterior distribution of attention. Similarly, we call

P (a_{k} | H_{k + 1}) = P (a_{k} | H_{k}, (t_{k + 1}, m_{k + 1}))

as the posterior attention

P o s t r_{k} (a_{k})

, since this is the attention distribution after observing the next answer at the corresponding step, unlike that in the prior attention.

We can obtain

P o s t r_{k} (a_{k})

by applying the Bayes rule as follows:

P o s t r_{k} (a_{k}) = P (a_{k} | H_{k + 1}) = \frac{P (t_{k + 1} | a_{k}, H_{k}) P (a_{k} | H_{k})}{P (t_{k + 1} | H_{k})} = \frac{P (t_{k + 1} | a_{k}, H_{k}) P r i o r_{k} (a_{k})}{\sum_{a_{k} = 1}^{k} P (t_{k + 1} | a_{k}, H_{k}) P r i o r_{k} (a_{k})} .

(13)

We expect this attention to be more accurate as it is computed with the knowledge of the posting time of the next answer. We then use the posterior attention

P o s t r_{k} (a_{k})

to update the decoding state

s_{k + 1}

by

s_{k + 1} = R N N (s_{k}, \sum_{a_{k} = 1}^{k} P o s t r_{k} (a_{k}) h_{a_{k}}) .

(14)

Next, we calculate the prior attention of step

k + 1

. It is remarkable that at step

k + 1

, we observe an event more than before, which changes the range of attention variables, i.e.,

a_{k} \in [1, \dots, k], a_{k + 1} \in [1, \dots, k, k + 1]

. Intuitively, we assume that the larger the k, the more similar the distribution of attention in steps k and

k + 1

. The greater the number of answers, the more the interest of people is allocated to a few high-quality answers. When a new answer is overwhelmed by other answers, it is difficult to attract people’s attention. Based on this observation, we compute

P r i o r_{k + 1} (a_{k + 1})

by

P r i o r_{k + 1} (a_{k + 1}) = \{\begin{matrix} P o s t r_{k} (a_{k}) (1 - \frac{1}{k + 1}), a = 1, \dots, k \\ \frac{1}{k + 1}, a = k + 1 \end{matrix} .

(15)

As k increases, the attention gained in the latest answer will become smaller and smaller, and the attention distribution of the old answer will be more and more similar to the previous one. The example in Figure 3a illustrates our idea. Since the new answer is sorted backwards, it is difficult to attract the attention of the user, and thus the attention distribution at this time should be similar to the previous moment

t_{k}

.

3.5. Convergence Strategy

In CQA, because users are usually attracted by a small number of high-quality answers, and because of the Matthew effect, users’ attention will become more and more concentrated. To capture the Matthew effect existing in CQA, we propose a convergence strategy in our PARPP to adjust the posterior attention mechanism. The idea is depicted in Figure 3b. Due to the existence of the Matthew effect, the popular answer will always accumulate popularity and form a gap with other answers. As time progresses, the votes for the top-ranked answers will far exceed those for the other answers. In extreme cases, the user’s attention will be focused entirely on the top-ranked answers. The upper right of Figure 3b shows the distribution of attention in this case. Formally, let

P_{t_{k}} (a)

indicate the distribution of attention when there are k answers, and we assume that the attention assigned to the first answer is the greatest. To facilitate the narrative, we assume that only the answer ranked first will continue to accumulate advantages, and the gap with the remaining answers will grow. Then, we expect

lim_{Δ t \to + \infty} P_{t_{k} + Δ t} (a) = P_{+ \infty} (a) = \{\begin{matrix} 1, a = 1 \\ 0, a = 2, \dots, k \end{matrix} .

(16)

At time

t_{k} + Δ t

, we expect that the attention distribution will converge toward the direction of

P_{+ \infty}

. Inspired by this, we calculate the attention distribution of time

t_{k + Δ t}

by

P_{t_{k + Δ t}} = min_{P_{t_{k + Δ t}}} ((1 - δ) D_{K L} (P_{t_{k}} | | P_{t_{k + Δ t}}) + δ D_{K L} (P_{+ \infty} | | P_{t_{k + Δ t}})),

(17)

where

D_{K L}

is the Kullback–Leibler divergence,

δ

is a function of t and

δ \propto t, δ \in (0, 1)

.

δ

can be regarded as the degree of convergence of

P_{t_{k}}

to

P_{+ \infty}

. The larger

Δ t

is, the larger

δ

is, and the closer

P_{t_{k + Δ t}}

is to

P_{+ \infty}

. The optimization problem of Formulation (17) can be solved using the method of Lagrange multipliers. Then, we have

P_{t_{k} + Δ t} (a) = \{\begin{matrix} (1 - δ) P_{t_{k}} (a) + δ, & a & = 1 \\ (1 - δ) P_{t_{k}} (a), & a & = 2, \dots, k \end{matrix} .

(18)

We use this convergence strategy in the attention mechanism of our proposed model. For convenience of description, as before, we assume that the attention assigned to the first answer in the posterior of the previous step is the greatest. Then, we change the calculation of prior attention in step

k + 1

by

P r i o r_{k + 1} (a) = \{\begin{matrix} ((1 - δ) P o s t r_{k} (a) + δ) (1 - \frac{1}{k + 1}), & a & = 1 \\ (1 - δ) P o s t r_{k} (a) (1 - \frac{1}{k + 1}), & a & = 2, \dots, k \\ \frac{1}{k + 1}, & a & = k + 1 \end{matrix},

(19)

where we let

δ = 1 - e^{- θ d t}

, with

d t = t_{k + 1} - t_{k}

, and

θ

being a positive parameter learning in the model.

A convergence strategy can also be extended to converge to the top-k attention variables. For example, we assume that the two highest attentions at time

t_{k}

are

P_{k} (a = 1) = μ

and

P_{k} (a = 2) = ν

; to converge toward the highest two attentions, we can make

P_{+ \infty} (a) = \{\begin{matrix} \frac{μ}{μ + ν}, & a & = 1 \\ \frac{ν}{μ + ν}, & a & = 2 \\ 0, & a & = 3, \dots, k \end{matrix} .

(20)

The rest of the calculations for

P_{k + 1} (a)

are similar to Equations (17)–(19). We denote PARPP with a convergence strategy as PARPP-CON.

3.6. Parameter Learning

In this section, we introduce the learning process of our proposed models. Given a collection of QA sequences

C = {S^{i}}_{i = 1}^{M}

, where

S^{i} = ({(t_{j}^{i}, m_{j}^{i})}_{j = 1}^{N^{i}})

, we assume that each question–answer cascade is independent of each other. As a result, the logarithmic likelihood of the QA set

C

is the sum of the logarithmic likelihood of the individual cascade. Then, we can train the model by minimizing the joint negative logarithmic likelihood

L (C)

:

\begin{matrix} L (C) = L ({S^{i}}) & = - \sum_{i = 1}^{M} \sum_{k = 1}^{N^{i} - 1} log P (t_{k + 1}^{i} | H_{k}^{i}) \\ = - \sum_{i = 1}^{M} \sum_{k = 1}^{N^{i} - 1} \sum_{a = 1}^{k} (log P (t_{k + 1}^{i} | a, H_{k}^{i}) + log P r i o r_{k} (a)) . \end{matrix}

(21)

We exploit the Back Propagation Through Time (BPTT) for training PARPP. In each training iteration, we vectorize the question as the first input and then its answering behaviors. Each input includes not only temporal features, such as the number of followers of users for each answer, but also the embedding of text information. The temporal features are assembled by the logarithm time interval

log (t_{k} - t_{k - 1})

and the discretization of numerical attributes according to the month, day, hour, minute and second. In order to obtain the encoder of the text information, we use bert-as-service [81], which is a toolkit based on the famous natural language processing (NLP) model developed by Google for pre-training language representations, and BERT [82], to map sentences into fixed-length representation. We then put the sentences representation to an RNN to obtain the text representation of an answer. For better results, we adopt LSTM [62], a modern variant of RNN, in our model. Then, we apply stochastic gradient descent (SGD) with mini-batch algorithms and update the parameters using the Adam optimizer [83]. We also employ an early stopping method [84] to prevent overfitting.

4. Experimental Setup

In this section, we describe the experimental dataset crawled from Zhihu, which is a famous Chinese CQA. We then compare our PARPP model to the state-of-the-art modeling methods of cascade prediction. Finally, we analyze how the proposed attention mechanism performs.

4.1. Research Questions

The research questions guiding the remainder of the paper are outlined below:

(RQ1) Given a question and the trace of observed answers, when will the next answer appear?
(RQ2) Given a question and the trace of observed answers, how many answers will be posted in the future?
(RQ3) Can the proposed attention mechanism capture the most influential answers?

4.2. Dataset

Our experiment is conducted on a real-world dataset: Zhihu, which is a famous Chinese CQA. In order to make the data more representative, we use a Zhihu API. (the API is publicly available at: https://syaning.github.io/zhihuapi-py/, accessed on 1 June 2024) to crawl the data from 25 distinguished topics in Zhihu, and there are up to 700 questions under each topic. In the end, we crawled 8272 questions published before 7 September 2019 (there are no restrictions on their posting timestamps), and the corresponding answers, which number more than 1.3 million. These questions were posted anywhere from two to three days ago to a few years ago. For each question and answer, the dataset includes the question/answer ID, their timestamps, the number of votes for the question, the number of followers of the poster, the text content and the comments. Table 2 shows the statistics of some topics in the dataset. To illustrate the details of the structure of the data in our dataset, Figure 4 shows an example data with an example question and its corresponding answers with timestamps.

In this paper, we focus on questions that have been submitted more than half a year before and have more than 50 answers within half a year. We remove answers that appear six months after the question was posted because if the time is too far away, it is no longer meaningful to predict the popularity of the question. Eventually, we filtered out 1888 questions with more than 840,000 answers. Details of the dataset are shown in Figure 5. Figure 5a shows the empirical distribution for the inter-event times, and the mean value is 25.15 h. For visualization convenience, we have taken the logarithm of the time interval. Figure 5b shows the distribution histogram of the answers’ posting time. From this figure, we can see that most of the answers appeared within 10 days after the question being asked, and the trend of entire answers conformed to the exponential decay. Therefore, in Equation (12), our conditional intensity function uses an exponential form. Finally, Figure 5c displays the evolution of the number of answers in 90 days. We take the number of answers within half a year after the question is raised as the final number and mark it with dashed lines in the figure. On average, the final number of answers to a question in half a year is ∼318, and a question receives nearly 80% of its retweets in the first 90 days.

We divide the dataset into a training set and a test set in a ratio of 9:1. We input the answers in each question answer cascade into the model in turn and predict when the next answer will appear and the number of answers in a specific point in time.

4.3. Matthew Effect in CQA

To examine the existence and effectiveness of the Matthew effect in CQA, we ranked all answers to each question based on the number of votes it received. Then, we calculated the percentage of votes received by the 10% and 20% top-ranked answers, respectively, against the total votes received by all answers. We also calculated the percentage of votes received by the 10% and 20% earliest posted answers (based on their posting dates), respectively, against total votes received by all answers. We randomly selected 10 topics from the Zhihu dataset and recorded the statistics of their number of questions and the average percentage of total votes that each of the ten topics received in Table 3.

We found that the top 10% of answers received an average of 82.1% of the total votes, while the top 20% of answers received an average of 90.1% of the total votes. In contrast, the earliest 10% of answers received an average of 29.6% of the total votes, and the earliest 20% of reviews received an average of 43.3% of the total votes. From the data, we found that a disproportionately higher percentage of votes went to early-posted answers. This shows that an early-posted quality answer will accumulate advantages over the time, which is evidence of existence of the Matthew effect in CQA.

Figure 6 provides another perspective on the Matthew effect in CQA. A direct result of the Matthew effect is the concentration of resources. Figure 6 shows the distribution histogram of the number of votes obtained by answers. Figure 6a shows the average proportion of votes received by the ten most popular answers in each question. Surprisingly, although the most popular answer is just one answer, it receives more than 40% of the votes for all the answers in a question. The distribution of the entire voting number is similar to the power law distribution, which is a typical distribution formed by the concentration of resources caused by the Matthew effect. Figure 6b shows the statistics result of votes of top-k% answers. We can see that on average, 5% of the answers received more than 80% of the votes, and 10% of the responses almost ruled the whole question. All these data indicated the existence and significant influence of the Matthew effect in CQA.

4.4. Baselines

To evaluate the predictive performance of our proposed model, we compare our proposed model PARPP with the following five prediction baselines. Methods (iv) and (v) can model both the next event type and the next activation time. Here, we only use them to predict the time of the next event.

Naive Prediction. This is a naive baseline that uses the time difference between the last two answeres to predict when the next answer will appear.
Poisson Process [85]. This is a point process that assumes the conditional intensity function is independent of the history $H_{k}$ and remains constant over time:

$λ (t) = λ_{0},$

(22)

where $λ_{0}$ is a fixed constant. In fact, the model produces an estimate of the average inter-event gaps.
Hawkes Process [86]. This is a self-exciting process in which the arrival of an event causes the conditional intensity function to increase. The intensity function is parameterized by

$λ (t) = λ_{0} + α \sum_{t_{k} < t} ϕ (t - t_{k}),$

(23)

where $λ_{0} ⩾ 0$ is the base rate and $ϕ (\cdot)$ is the triggering kernel capturing temporal dependencies. In our experiments, we choose $ϕ (t) = e^{- β t}$ . The parameters $λ_{0}$ , $α$ , and $β$ are optimized by maximizing the likelihood function.
RMTPP [59]. The recurrent marked temporal point process (RMTPP) is a temporal point processes method based on the RNN. The original model not only predicts the time of the next event but also predicts the type of the next event. Here, we only use it to predict the next time. Specifically, RMTPP applies an RNN to learn a general representation $h_{k}$ of a nonlinear dependency over both the time and the marker information from past events:

$h_{k} = max {W^{m} m_{k} + W^{t} t_{k} + W^{h} h_{k - 1}, 0},$

(24)

where $W^{m}$ , $W^{t}$ , and $W^{h}$ are weight matrices, $m_{k}$ represents the marker features, and $t_{k}$ represents the temporal features. Then, the conditional intensity function can be formulated by

$λ (t) = exp ({v^{t}}^{T} \cdot h_{k} + w^{t} (t - t_{k}) + b^{t}),$

(25)

where $v^{t}$ is a column vector, and $w^{t}$ and $b^{t}$ are scalars.
CYAN-RNN [8]. This is an attention-based RNN method to model cascade dynamics. Similarly, the conditional intensity function is computed by

$λ (t) = exp (v^{T} \cdot h_{k} + u^{T} \cdot s_{k} + z^{T} \cdot x_{k} + w (t - t_{k})),$

(26)

where $v$ , $u$ , $z$ , and w are learning parameters, $x_{k}$ is the vector of the k-th event ( $t_{k}, m_{k}$ ), and $s_{k}$ is the decoding state which can be calculated through an attention mechanism by

$s_{k} = σ (x^{k}, s_{k - 1}, \sum_{a} P_{k} (a) h_{a}),$

(27)

where $σ$ is a nonlinear activation function. The attention weight $P_{k} (a)$ is formalized as

$P_{k} (a) = \frac{e^{A_{θ} (h_{a}, s_{k})}}{\sum_{i = 1}^{k} e^{A_{θ} (h_{i}, s_{k})}},$

(28)

where $A_{θ}$ is a non-linear function.
The author [8] further proposed a coverage strategy to adjust the misallocation of attention. We denote CYAN-RNN with coverage as CYAN-RNN(cov).

4.5. Evaluation Metrics

The evaluation metrics used to evaluate the performance of answers’ popularities are the ones widely used in point processes [8,20,55,56]: root mean square error (RMSE), mean absolute percentage error (MAPE) and mean absolute error per day, of which RMSE and

M A P E_{t}

measure the performance in next-answer prediction, while

M A P E_{N}

and absolute error per day measure the performance in the answers’ count prediction. At step k, models output the corresponding conditional density function

λ (t | H_{k})

each time a prediction is made. For predicting the time of the next answer, we use Equation (6) to obtain the next time prediction

{\tilde{t}}_{k + 1}

. For predicting the number of answers, we use Equation (7) to obtain the prediction

{\tilde{N}}_{t_{k} + Δ t}

. Then, these metrics can be computed as follows:

RMSE. Every time we make a prediction, we have an estimate ${\tilde{t}}_{k + 1}$ . We can obtain RMSE for time prediction by

$R M S E = \sqrt{\frac{{(\sum_{n} {\tilde{t}}_{k + 1} - t_{k + 1})}^{2}}{n}},$

(29)

where $t_{k + 1}$ is the ground-truth prediction and n is the number of predictions.
MAPE. $M A P E_{t}$ reports the mean deviation of the estimated timestamps of the next future events from their actual times:

$M A P E_{t} = \frac{100 %}{n} \sum_{n} \frac{| {\tilde{t}}_{k + 1} - t_{k + 1} |}{t_{k + 1}} .$

(30)

Additionally,

M A P E_{N}

measures the count prediction errors defined as below:

M A P E_{N} = \frac{100 %}{n} \sum_{n} \frac{| {\tilde{N}}_{t_{k} + Δ t} - N_{t_{k} + Δ t} |}{N_{t_{k} + Δ t}} .

(31)

Mean Absolute Error per Day: This indicator can measure the impact of forecast duration on forecast results and is defined as follows:

$ϵ = \frac{1}{n Δ t} \sum_{n} | {\tilde{N}}_{t_{k} + Δ t} - N_{t_{k} + Δ t} |,$

(32)

where $N_{t_{k} + Δ t}$ is the ground-truth result.

The evaluation metrics used to evaluate the alignment of the attention mechanism is P@k (Precision at k) [87] and NDCG@k (Normalized Discounted Cumulative Gain at k [88], as we rank and select the top-k answers with the highest attention for precise prediction purposes in the model. We sort the attention of each prediction and select the top-k answers with the highest attention (denoted as the predicted rank list

{\tilde{R}}_{k}

) and compare them with the actual situation, which is sorted by the number of comments (denoted as the ground truth list

R_{k}

). Because each comment has a timestamp, we use the number of comments instead of votes to track the dynamic changes in user attention. The details of the calculation are as follows:

Precision. Precision is a widely used metric to judge whether a model can correctly predict the positive samples. We calculate how many answers are in both the set ${\tilde{R}}_{k}$ and the ground-truth set $R_{k}$ and treat them as true positive. We calculate P@k as shown below:

$P @ k = \frac{| {\tilde{R}}_{k} \cap R_{k} |}{k} .$

(33)
NDCG. Because P@k does not consider the ranking position, we also use NDCG@k. We denote $a n s_{i}$ as the i-th answer in ${\tilde{R}}_{k}$ and define the relevance ( $r e l_{i}$ ) of $a n s_{i}$ as follows:

$r e l_{i} = \{\begin{matrix} 1, a n s_{i} \in R_{k} \\ 0, a n s_{i} \notin R_{k} \end{matrix} .$

(34)

Then, we can obtain DCG (Discounted Cumulative Gain) as below:

$D C G @ k = \sum_{i = 1}^{k} \frac{r e l_{i}}{{log}_{2} (i + 1)} .$

(35)

And we obtain IDCG (Ideal Discounted Cumulative Gain) as below:

$I D C G @ k = \sum_{i = 1}^{k} \frac{1}{{log}_{2} (i + 1)} .$

(36)

Finally, we calculate NDCG@k as follows:

$N D C G @ k = \frac{D C G @ k}{I D C G @ k} .$

(37)

Higher P@k and NDCG@k scores, and lower RMSE and MAPE scores, indicate better performance.

5. Results and Analysis

In this section, we answer the research questions and analyze our experimental results.

5.1. Predict the Time of the Next Answer

RQ1: We first compare the performance of our proposed model (PARPP) and the baselines in predicting the time at which the next answer appears. Table 4 shows the results of our experiment. PARPP-CON is the same as PARPP except that it uses a convergence strategy. The hyperparameters of PARPP and PARPP-CON are set as the following: learning rate {0.1, 0.01, 0.001}; hidden layer size of encoder and decoder {8, 16, 32}; and batch size {8, 16, 32}. We choose the best results from each configuration. In order to verify the optimal effect of our proposed convergence strategy, we set up five comparison groups {Top1%, Top5%, Top10%, Top15%, Top20%}, which correspond to the convergence of the 1%, 5%, 10%, 15% and 20%, respectively.

It can be seen from Table 4 that in the prediction task, the RMSE and

{MAPE}_{t}

evaluation scores obtained by PARPP and PARPP-CON are lower than the other baselines, indicating the validity of our proposed model. Moreover, we can observe that PARPP-CON consistently performs better than PARPP, especially in the case of the Top15%. Compared to the second best model CYAN-RNN(cov), PARPP-CON reduces 10.60% relative errors on RMSE and reduces relative errors on

{MAPE}_{t}

by 31.7%, implying that the convergence strategy can better capture the impact of each answer on the problem.

An interesting finding in the table is that the performances of the Possion process and Hawkes process in RMSE are not too prominent, but they perform quite poorly in terms of the evaluation metric MAPE. We plot the change in MAPE in Figure 7. The x-axis is the observation time, and we make a time prediction every six hours. For brevity, we have only drawn the model PARPP-CON with the convergence of Top15% because it performs the best. At early stages, traditional point process methods, the Possion process and the Hawkes process, find it difficult to obtain accurate predictions. It may be that their assumptions about the underlying diffusion model do not fit the situation in CQA. On the contrary, the model based on the recurrent neural network works well, and our model is optimal.

5.2. Predict the Number of Answers

RQ2: We then test the performance of our model in predicting the number of answers in the future. We set 10 observation time points T, starting from the moment when the problem occurs, with the time interval varying from 1 day to 10 days. At each observation point, we enter the observed answers into the model and then use the resulting conditional intensity to predict the number of answers for the future. The window sizes

Δ_{p r e d}

we aim to predict are 1 to 10 days. Figure 8 shows the prediction results based on the mean absolute error per day for each model and Figure 9 based on

{MAPE}_{N}

. As the plots for the three models, i.e., Hawkes, PARPP, and PARPP-CONT(Top15%), overlap each other in Figure 8a, which makes them hard to be distinguished, we zoom in the plots of these three models and show them in Figure 8b. The same plots applied to Figure 8c and Figure 9a,c, resulting in the zoomed-in figures, i.e., Figure 8d and Figure 9b,d, respectively. Because the naive method does not take into account the fact that the popularity of the problem will decline, its prediction performance is poor. For convenience of demonstration, we did not draw the results. Similarly, for the sake of brevity, we only draw the results of the PARPP-CON(Top15%) model, which performs the best in the previous prediction task.

From Figure 8 and Figure 9, we can see that the neural network methods RMTPP, CYAN-RNN and CYAN-RNN(cov), which did well in next-answer prediction, have very poor results in this prediction task. One important reason is that their models are trained to maximize the probability of the next occurrence rather than the number of answers in the future. Point processes learned by these models are more likely to predict the next point in time and lose the ability to predict the number of future events. Surprisingly, although our model is also trained with the maximum likelihood of the next event, in this prediction task, our model exhibits competitive performance compared to the traditional methods, which reflects the strong adjustment ability of our posterior mechanism. It can be seen that in most cases, the prediction effect of our model is the best, and the longer the observation time, the greater the advantage of our model. This is also what we expect. After all, the longer the observation time is, the more accurate the Bayesian criterion will be. Compared with the suboptimal model Hawkes, our best performing model, PARPP-CON(Top15%) can obtain up to 15.8% (18.9%) improvement on mean error and up to 20.8% (14.6%) improvement on

{MAPE}_{N}

in the case of

Δ_{p r e d} = 1

day (10 days). These results demonstrate the power of our proposed model to achieve optimal results in predicting answer times while maintaining competitive results in predicting answer quantities.

5.3. Evaluation on Attention Mechanism

RQ3: We expect to check whether our proposed posterior attention mechanism and convergence strategy can capture high-quality answers. Ideally, the answer’s quality can be judged based on the number of votes it receives. Intuitively, the higher the vote number of the answer, the higher its ranking, and the greater the impact on the problem. However, since the Zhihu CQA does not provide a timestamp for each vote, we cannot obtain the dynamic change of the number of votes for the answers, so we use the number of comments on the answers instead to track their popularity. Whenever a new answer appears, we sort all the observed answers by their cumulative number of comments at this time point and then select the top-k answers. We consider these top-k answers as the answers that have the biggest impact on the issue at this moment and treat them as ground true. Then, we compare the answers of the attention distribution top-k in the model with ground truth and calculate the number of overlaps. We use P@k (precision at k) and NDCG@k (normalized discounted cumulative gain at k) as evaluation metrics.

Table 5 shows the results. From the table, we can see that our model can be more accurately aligned with the heat of the answer than the CYAN-RNN model, which also uses the attention mechanism. We also compared the performance of prior attention and posterior attention on alignment. It can be seen that posterior attention is generally better than prior attention in alignment, which is also in line with our expectations. From the table, we can also see that the models that perform well in the two prediction tasks, PARPP-CON(Top10%) and PARPP-CON(Top15%), also work well in alignment. This confirms our hypothesis that popular answers have a greater impact on the question and also confirms the effectiveness of the convergence strategy.

6. Conclusions

In this paper, we studied the problem of predicting popularities of questions in Community Question Answering (CQA): given a question and its historical answers, predict the time of the next arriving answer and the number of answers at a particular point in time. To tackle this problem, we propose a Posterior Attention Recurrent Point Process (PARPP) model that takes advantages of RNNs’ abilities to model cascade conveniently and effectively, and we add a posterior attention mechanism to better capture the impact of each answer on the problem. In addition, in CQA, due to the interaction of users, the ranking of answers has an obvious Matthew effect, that is, early and good-quality answers more easily obtain a better ranking, and the advantages of ranking can make them more easily noticed by users and accumulate these advantages. In order to capture the impact of the Matthew effect in CQA on the popularity of the problem, we further proposed a convergence strategy. We evaluate the effectiveness of our proposed models on a real dataset crawled from the Zhihu platform. The experimental results demonstrate that our proposed models can consistently outperform state-of-the-art methods when predicting the time of the next answer and yield competitive performance when predicting the number of answers in the future. Additionally, our revised model, PARPP-CON, performs consistently better than PARPP, implying that the convergence strategy can effectively capture the impact of the Matthew effect on the question.

Limitations of this study include the following: (1) We only take into account the textual information of the answers for population prediction of the answers. Multimodal information that is naturally embedded in the answers such as images, number of thumbs-up, etc., should be taken into account as well. (2) We format the data of questions and their answers as sequential data and assume that questions to each other are independent. Formatting the questions and their answers as a social network sounds more reliable. (3) Most recent advances in deep learning such as large language models are not taken into account for the improvement of the question population prediction, and how to integrate and utilize large language models into our prediction task is unexplored in this paper.

As future work, besides the limitations that we take as future directions, there are still a number of unexplored avenues that are of value to investigated. Is our proposed model still effective at predicting popularities in platforms other than Zhihu? We will conduct experiments on more CQA datasets to investigate and test if the proposed model still outperforms the state-of-the-art models. In addition, the proposed model simply encodes text information with BERT [82]. How to effectively extract rich text information in CQA and take them as input to the proposed model requires more exploration. How to take social network information into consideration is also another direction worth exploring. We are planning to take GPT-based models as baselines in our experiments for comparisons. We are also planning to leverage external resources such as the news articles published in news online platforms such as BBC News and Toutiao to boost questions’ popularity prediction performance. Finally, we intend to simulate a dataset and conduct experiments on this simulation dataset to further explore the performance of our proposed model.

Author Contributions

Experiments, Z.W. and Y.W.; supervision, S.L.; writing—original draft, Z.W., Y.W. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is a phased result of the general project of the Humanities and Social Sciences Research of the Ministry of Education “Research on the Evaluation Mechanism of Judicial Reform Effectiveness in the Era of Big Data” (Grant No. 19YJC820058). This work is partly supported by the National Natural Science Foundation of China (Grant No. 61906219). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Roy, P.K.; Saumya, S.; Singh, J.P.; Banerjee, S.; Gutub, A. Analysis of community question-answering issues via machine learning and deep learning: State-of-the-art review. CAAI Trans. Intell. Technol. 2023, 8, 95–117. [Google Scholar] [CrossRef]
Kandpal, N.; Deng, H.; Roberts, A.; Wallace, E.; Raffel, C. Large language models struggle to learn long-tail knowledge. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 15696–15707. [Google Scholar]
Wang, Y.; Lipka, N.; Rossi, R.A.; Siu, A.; Zhang, R.; Derr, T. Knowledge graph prompting for multi-document question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–28 February 2024; Volume 38, pp. 19206–19214. [Google Scholar]
Wu, Y.; Fu, Y.; Xu, J.; Yin, H.; Zhou, Q.; Liu, D. Heterogeneous question answering community detection based on graph neural network. Inf. Sci. 2023, 621, 652–671. [Google Scholar] [CrossRef]
Annamoradnejad, I.; Habibi, J. Automatic Moderation of User-Generated Content. In Encyclopedia of Data Science and Machine Learning; IGI Global: Hershey, PA, USA, 2023; pp. 1344–1355. [Google Scholar]
Liu, T.; Zhang, W.N.; Cao, L.; Zhang, Y. Question popularity analysis and prediction in community question answering services. PLoS ONE 2014, 9, e85236. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Yin, H.; Zhou, Q.; Liu, D.; Wei, D.; Dong, J. Multi-hop community question answering based on multi-aspect heterogeneous graph. Inf. Process. Manag. 2024, 61, 103543. [Google Scholar] [CrossRef]
Wang, Y.; Shen, H.; Liu, S.; Gao, J.; Cheng, X. Cascade Dynamics Modeling with Attention-based Recurrent Neural Network. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 2985–2991. [Google Scholar]
Embrechts, P.; Liniger, T.; Lin, L. Multivariate Hawkes processes: An application to financial data. J. Appl. Probab. 2011, 48, 367–378. [Google Scholar] [CrossRef]
Rizoiu, M.A.; Xie, L.; Sanner, S.; Cebrian, M.; Yu, H.; Van Hentenryck, P. Expecting to be hip: Hawkes intensity processes for social media popularity. In Proceedings of the 26th International World Wide Web Conferences Steering Committee, Perth, Australia, 3–7 May 2017; pp. 735–744. [Google Scholar]
Merton, R.K. The Matthew effect in science: The reward and communication systems of science are considered. Science 1968, 159, 56–63. [Google Scholar] [CrossRef] [PubMed]
Liao, C.H. The Matthew effect and the halo effect in research funding. J. Inf. 2021, 15, 101108. [Google Scholar] [CrossRef]
Liu, M.; Li, S.; Jin, L. Modeling and analysis of Matthew effect under switching social networks via distributed competition. IEEE/CAA J. Autom. Sin. 2022, 9, 1311–1314. [Google Scholar] [CrossRef]
Saez, E.; Zucman, G. Exploding Wealth Inequality in the United States; Washington Center for Equitable Growth: Washington, DC, USA, 2014. [Google Scholar]
Bol, T.; de Vaan, M.; van de Rijt, A. The Matthew effect in science funding. Proc. Natl. Acad. Sci. USA 2018, 115, 4887–4890. [Google Scholar] [CrossRef]
Pavolini, E.; Van Lancker, W. The Matthew effect in childcare use: A matter of policies or preferences? J. Eur. Public Policy 2018, 25, 878–893. [Google Scholar] [CrossRef]
Wan, Y. The Matthew effect in social commerce. Electron. Mark. 2015, 25, 313–324. [Google Scholar] [CrossRef]
Tang, T.L.P. The Matthew Effect in monetary wisdom. In Monetary Wisdom; Academic Press: Cambridge, MA, USA, 2024; pp. 387–406. [Google Scholar] [CrossRef]
Crawford, M.B. The World beyond Your Head: On Becoming an Individual in an Age of Distraction; Farrar, Straus and Giroux: New York, NY, USA, 2015. [Google Scholar]
Du, N.; Dai, H.; Trivedi, R.; Upadhyay, U.; Gomez-Rodriguez, M.; Song, L. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1555–1564. [Google Scholar]
Saha, A.; Samanta, B.; Ganguly, N.; De, A. Crpp: Competing recurrent point process for modeling visibility dynamics in information diffusion. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 537–546. [Google Scholar]
Zuo, S.; Liu, T.; Zhao, T.; Zha, H. Differentially Private Estimation of Hawkes Process. arXiv 2022, arXiv:2209.07303. [Google Scholar]
Hawkes, A.G. Spectra of some self-exciting and mutually exciting point processes. Biometrika 1971, 58, 83–90. [Google Scholar] [CrossRef]
Liniger, T.J. Multivariate Hawkes Processes. Ph.D. Thesis, ETH Zurich, Zürich, Switzerland, 2009. [Google Scholar]
Ananthanarayana, T.; Srivastava, P.; Chintha, A.; Santha, A.; Landy, B.; Panaro, J.; Webster, A.; Kotecha, N.; Sah, S.; Sarchet, T.; et al. Deep learning methods for sign language translation. ACM Trans. Access. Comput. (TACCESS) 2021, 14, 22. [Google Scholar] [CrossRef]
Escolano, C.; Costa-jussà, M.R.; Fonollosa, J.A. Multilingual machine translation: Deep analysis of language-specific encoder-decoders. J. Artif. Intell. Res. 2022, 73, 1535–1552. [Google Scholar] [CrossRef]
Zhou, G.; Liu, Y.; Liu, F.; Zeng, D.; Zhao, J. Improving question retrieval in community question answering using world knowledge. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
Zhou, G.; He, T.; Zhao, J.; Hu, P. Learning continuous word embedding with metadata for question retrieval in community question answering. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 250–259. [Google Scholar]
Qiu, X.; Huang, X. Convolutional neural tensor network architecture for community-based question answering. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Wu, W.; Xu, S.; Houfeng, W. Question condensing networks for answer selection in community question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1746–1755. [Google Scholar]
Zhou, X.; Hu, B.; Chen, Q.; Wang, X. Recurrent convolutional neural network for answer selection in community question answering. Neurocomputing 2018, 274, 8–18. [Google Scholar] [CrossRef]
Li, B.; Jin, T.; Lyu, M.R.; King, I.; Mak, B. Analyzing and predicting question quality in community question answering services. In Proceedings of the 21st International Conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 775–782. [Google Scholar]
Baltadzhieva, A.; Chrupała, G. Question quality in community question answering forums: A survey. ACM Sigkdd Explor. Newsl. 2015, 17, 8–13. [Google Scholar] [CrossRef]
Zhao, Z.; Yang, Q.; Cai, D.; He, X.; Zhuang, Y. Expert Finding for Community-Based Question Answering via Ranking Metric Network Learning. In Proceedings of the IJCAI, New York, NY, USA, 9–16 July 2016; Volume 16, pp. 3000–3006. [Google Scholar]
Geerthik, S.; Gandhi, K.R.; Venkatraman, S. Domain expert ranking for finding domain authoritative users on community question answering sites. In Proceedings of the 2016 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Chennai, India, 15–17 December 2016; pp. 1–5. [Google Scholar]
Liska, A.; Kocisky, T.; Gribovskaya, E.; Terzi, T.; Sezener, E.; Agrawal, D.; Cyprien De Masson, D.; Scholtes, T.; Zaheer, M.; Young, S.; et al. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 25–27 July 2022; pp. 13604–13622. [Google Scholar]
Zhuang, Y.; Yu, Y.; Wang, K.; Sun, H.; Zhang, C. Toolqa: A dataset for llm question answering with external tools. Adv. Neural Inf. Process. Syst. 2024, 36, 50117–50143. [Google Scholar]
Quan, X.; Lu, Y.; Liu, W. Towards modeling question popularity in community question answering. In Proceedings of the 2012 IEEE 11th International Conference on Cognitive Informatics and Cognitive Computing, Kyoto, Japan, 22–24 August 2012; pp. 109–114. [Google Scholar]
Guo, N.; Liu, C.; Li, C.; Zeng, Q.; Ouyang, C.; Liu, Q.; Lu, X. Explainable and Effective Process Remaining Time Prediction Using Feature-Informed Cascade Prediction Model. IEEE Trans. Serv. Comput. 2024, 17, 949–962. [Google Scholar] [CrossRef]
Cheng, J.; Adamic, L.; Dow, P.A.; Kleinberg, J.M.; Leskovec, J. Can cascades be predicted? In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Republic of Korea, 7–11 April 2014; pp. 925–936. [Google Scholar]
Gupta, S.; Kambli, R.; Wagh, S.; Kazi, F. Support-vector-machine-based proactive cascade prediction in smart grid using probabilistic framework. IEEE Trans. Ind. Electron. 2014, 62, 2478–2486. [Google Scholar] [CrossRef]
Yang, M.; Li, Z.; Zhou, M.; Liu, J.; King, I. Hicf: Hyperbolic informative collaborative filtering. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 2212–2221. [Google Scholar]
Zaman, T.; Fox, E.B.; Bradlow, E.T. A bayesian approach for predicting the popularity of tweets. Ann. Appl. Stat. 2014, 8, 1583–1611. [Google Scholar] [CrossRef]
Wu, B.; Mei, T.; Cheng, W.H.; Zhang, Y. Unfolding temporal dynamics: Predicting social media popularity using multi-scale temporal decomposition. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Zhong, C.; Xiong, F.; Pan, S.; Wang, L.; Xiong, X. Hierarchical attention neural network for information cascade prediction. Inf. Sci. 2023, 622, 1109–1127. [Google Scholar] [CrossRef]
Jitkrittum, W.; Gupta, N.; Menon, A.K.; Narasimhan, H.; Rawat, A.; Kumar, S. When Does Confidence-Based Cascade Deferral Suffice? Adv. Neural Inf. Process. Syst. 2024, 36, 9891–9906. [Google Scholar]
Bandari, R.; Asur, S.; Huberman, B.A. The pulse of news in social media: Forecasting popularity. In Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media, Dublin, Ireland, 4–7 June 2012. [Google Scholar]
Cox, D.R.; Isham, V. Point Processes; CRC Press: Boca Raton, FL, USA, 1980; Volume 12. [Google Scholar]
Ogata, Y. Seismicity analysis through point-process modeling: A review. In Seismicity Patterns, Their Statistical Significance and Physical Meaning; Springer: Berlin/Heidelberg, Germany, 1999; pp. 471–507. [Google Scholar]
Bray, A.; Schoenberg, F.P. Assessment of point process models for earthquake forecasting. Stat. Sci. 2013, 510–520. [Google Scholar] [CrossRef]
Pratiwi, H.; Rini, L.S.; Mangku, I.W. Marked point process for modelling seismic activity (case study in Sumatra and Java). J. Phys. Conf. Ser. 2018, 1022, 012004. [Google Scholar] [CrossRef]
Soriano-Redondo, A.; Jones-Todd, C.M.; Bearhop, S.; Hilton, G.M.; Lock, L.; Stanbury, A.; Votier, S.C.; Illian, J.B. Understanding species distribution in dynamic populations: A new approach using spatio-temporal point process models. Ecography 2019, 42, 1092–1102. [Google Scholar] [CrossRef]
Bauwens, L.; Hautsch, N. Modelling financial high frequency data using point processes. In Handbook of Financial Time Series; Springer: Berlin/Heidelberg, Germany, 2009; pp. 953–979. [Google Scholar]
Filimonov, V.; Sornette, D. Apparent criticality and calibration issues in the Hawkes self-excited point process model: Application to high-frequency financial data. Quant. Financ. 2015, 15, 1293–1314. [Google Scholar] [CrossRef]
Zhao, Q.; Erdogdu, M.A.; He, H.Y.; Rajaraman, A.; Leskovec, J. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 10–13 August 2015; pp. 1513–1522. [Google Scholar]
Kobayashi, R.; Lambiotte, R. Tideh: Time-dependent hawkes process for predicting retweet dynamics. In Proceedings of the Tenth International AAAI Conference on Web and Social Media, Cologne, Germany, 17–20 May 2016. [Google Scholar]
Kingman, J.F.C. Poisson Processes. In Encyclopedia of Biostatistics; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2005. [Google Scholar] [CrossRef]
Isham, V.; Westcott, M. A self-correcting point process. Stoch. Process. Their Appl. 1979, 8, 335–347. [Google Scholar] [CrossRef]
Dheur, V.; Bosser, T.; Izbicki, R.; Taieb, S.B. Distribution-Free Conformal Joint Prediction Regions for Neural Marked Temporal Point Processes. arXiv 2024, arXiv:2401.04612. [Google Scholar] [CrossRef]
Borrajo, M.; González-Manteiga, W.; Martínez-Miranda, M. Goodness-of-fit test for point processes first-order intensity. Comput. Stat. Data Anal. 2024, 194, 107929. [Google Scholar] [CrossRef]
Kındap, Y.; Godsill, S. Point process simulation of generalised hyperbolic Lévy processes. Stat. Comput. 2024, 34, 33. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
Shen, X.; Yuan, S.; Sheng, H.; Du, H.; Yu, X. Auslan-Daily: Australian Sign Language translation for daily communication and news. Adv. Neural Inf. Process. Syst. 2024, 36, 80455–80469. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Yu, L.; Buys, J.; Blunsom, P. Online segment to segment neural transduction. arXiv 2016, arXiv:1609.08194. [Google Scholar]
Aharoni, R.; Goldberg, Y. Morphological inflection generation with hard monotonic attention. arXiv 2016, arXiv:1611.01487. [Google Scholar]
Yang, Z.; Hu, Z.; Deng, Y.; Dyer, C.; Smola, A. Neural machine translation with recurrent attention modeling. arXiv 2016, arXiv:1607.05108. [Google Scholar]
Deng, Y.; Kim, Y.; Chiu, J.; Guo, D.; Rush, A. Latent alignment and variational attention. Adv. Neural Inf. Process. Syst. 2018, 31, 9712–9724. [Google Scholar]
Wu, S.; Shapiro, P.; Cotterell, R. Hard Non-Monotonic Attention for Character-Level Transduction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4425–4438. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 31, 5998–6008. [Google Scholar]
Fazil, M.; Sah, A.K.; Abulaish, M. Deepsbd: A deep neural network model with attention mechanism for socialbot detection. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4211–4223. [Google Scholar] [CrossRef]
Mazzia, V.; Angarano, S.; Salvetti, F.; Angelini, F.; Chiaberge, M. Action Transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognit. 2022, 124, 108487. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Li, L.; Xie, H.; Cai, T.; Zhu, X.; Li, Q. Framu: Attention-based machine unlearning using federated reinforcement learning. IEEE Trans. Knowl. Data Eng. 2024. [Google Scholar] [CrossRef]
Shankar, S.; Sarawagi, S. Posterior Attention Models for Sequence to Sequence Learning. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kempe, C.; Eriksson-Gustavsson, A.L.; Samuelsson, S. Are there any Matthew effects in literacy and cognitive development? Scand. J. Educ. Res. 2011, 55, 181–196. [Google Scholar] [CrossRef]
Gómez-Bengoechea, G.; Jung, J. The Matthew effect: Evidence on firms’ digitalization distributional effects. Technol. Soc. 2024, 76, 102423. [Google Scholar] [CrossRef]
Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes 3rd Edition: The Art of Scientific Computing; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
Xiao, H. Bert-As-Service. 2018. Available online: https://github.com/hanxiao/bert-as-service (accessed on 20 May 2021).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Prechelt, L. Automatic early stopping using cross validation: Quantifying the criteria. Neural Netw. 1998, 11, 761–767. [Google Scholar] [CrossRef]
Morales-Navarrete, D.; Bevilacqua, M.; Caamaño-Carrillo, C.; Castro, L.M. Modeling point referenced spatial count data: A Poisson process approach. J. Am. Stat. Assoc. 2024, 119, 664–677. [Google Scholar] [CrossRef]
Kumar, P. Deep Hawkes process for high-frequency market making. J. Bank. Financ. Technol. 2024, 1–18. [Google Scholar] [CrossRef]
Croft, W.B.; Metzler, D.; Strohman, T. Search Engines: Information Retrieval in Practice; Addison-Wesley Reading: Boston, MA, USA, 2010; Volume 520. [Google Scholar]
Järvelin, K.; Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 2002, 20, 422–446. [Google Scholar] [CrossRef]

Figure 1. (a) An example screenshot, including a question and its two top-ranked answers, taken from the Zhihu app, which is a famous Chinese Community Question Answering forum. (b) The translation of (a). As we can see, the number of answers to this question is nearly 9000. Because the ranking of answers in CQA is mainly based on user feedback, the top answer is not the latest one, but the one with the most upvotes.

Figure 2. The framework of the proposed PARPP. The figure presents the case when predicting the time when the

(k + 1)

-th answer appears. The sequence at the bottom is the observed cascade and the sequence at the top is the predictive answering behaviors.

h_{k}

represents the encoding of the k-th answer and

s_{k}

represents the decoding state. Unlike the traditional attention mechanism, we use attention after decoding and update the decoding status with posterior attention after observing the next answer.

Figure 2. The framework of the proposed PARPP. The figure presents the case when predicting the time when the

(k + 1)

-th answer appears. The sequence at the bottom is the observed cascade and the sequence at the top is the predictive answering behaviors.

h_{k}

represents the encoding of the k-th answer and

s_{k}

represents the decoding state. Unlike the traditional attention mechanism, we use attention after decoding and update the decoding status with posterior attention after observing the next answer.

Figure 3. Graphical illustration of two kinds of attention mechanisms. (a) The attention mechanism applied in PARPP. On the top is the distribution of attention when k answers are observed. Below is the distribution of attention when a new answer appears. Since the new answer is not easily noticed, people’s attention distribution should be similar to that at the

t_{k}

-moment with very little attention allocated to the new answers. (b) The attention mechanism with convergence applied in PARPP-CON. Because of the Matthew effect in CQA, the top-ranked answers will accumulate their own advantages. Over time, an extreme situation occurs in which people are attracted by only the first few answers and completely ignore the others, as shown in the left chart. Therefore, we believe that when a new answer appears, attention will somehow converge to the top-ranked answers.

Figure 3. Graphical illustration of two kinds of attention mechanisms. (a) The attention mechanism applied in PARPP. On the top is the distribution of attention when k answers are observed. Below is the distribution of attention when a new answer appears. Since the new answer is not easily noticed, people’s attention distribution should be similar to that at the

t_{k}

-moment with very little attention allocated to the new answers. (b) The attention mechanism with convergence applied in PARPP-CON. Because of the Matthew effect in CQA, the top-ranked answers will accumulate their own advantages. Over time, an extreme situation occurs in which people are attracted by only the first few answers and completely ignore the others, as shown in the left chart. Therefore, we believe that when a new answer appears, attention will somehow converge to the top-ranked answers.

Figure 4. Example data. The first and the second rows contain the original data, while the third and the fourth contain the corresponding English translation. The first row refers to the question, while the second row refers to the answers with timestamps.

Figure 5. (a) Empirical time interval distribution histogram. The x-axis is in log-scale and the unit is hours. The mean value is 25.15 h. (b) Histogram of the time when the answer appears based on the time raised by the question. The unit is day. Only answers that appear within 60 days are displayed here. (c) Convergence of the mean and media cumulative answer count

N_{t}

as a function of time. The horizontal dotted lines correspond to the mean and median final answers count over 180 days. On average, a question receives nearly 80% of its answers in the first 90 days.

Figure 5. (a) Empirical time interval distribution histogram. The x-axis is in log-scale and the unit is hours. The mean value is 25.15 h. (b) Histogram of the time when the answer appears based on the time raised by the question. The unit is day. Only answers that appear within 60 days are displayed here. (c) Convergence of the mean and media cumulative answer count

N_{t}

as a function of time. The horizontal dotted lines correspond to the mean and median final answers count over 180 days. On average, a question receives nearly 80% of its answers in the first 90 days.

Figure 6. Distribution histogram of the number of upvotes obtained by answers. (a) shows the statistics of the top-k answers and (b) shows the statistics of the top-k% answers.

Figure 7. Performance of predicting the time of the next answer. The x-axis is the observation time, and a time prediction is made every six hours. (a)

{MAPE}_{t}

for Zhihu; (b) magnified view of (a).

Figure 7. Performance of predicting the time of the next answer. The x-axis is the observation time, and a time prediction is made every six hours. (a)

{MAPE}_{t}

for Zhihu; (b) magnified view of (a).

Figure 8. Dependence of the popularity prediction error on observation time. The window sizes selected for prediction

Δ_{p r e d}

are 1 day and 10 days. (a) popularity prediction performance (

Δ_{p r e d} = 1

day); (b) magnified view of (a) popularity prediction performance; (c) popularity prediction performance (

Δ_{p r e d} = 10

days); (d) magnified view of (c) popularity prediction performance.

Figure 8. Dependence of the popularity prediction error on observation time. The window sizes selected for prediction

Δ_{p r e d}

are 1 day and 10 days. (a) popularity prediction performance (

Δ_{p r e d} = 1

day); (b) magnified view of (a) popularity prediction performance; (c) popularity prediction performance (

Δ_{p r e d} = 10

days); (d) magnified view of (c) popularity prediction performance.

Figure 9. Dependence of the popularity prediction error on observation time. The window sizes selected for prediction

Δ_{p r e d}

are 1 day and 10 days. (a) popularity prediction performance (

Δ_{p r e d} = 1

day); (b) magnified view of (a) popularity prediction performance; (c) popularity prediction performance (

Δ_{p r e d} = 10

days); (d) magnified view of (c) popularity prediction performance.

Figure 9. Dependence of the popularity prediction error on observation time. The window sizes selected for prediction

Δ_{p r e d}

are 1 day and 10 days. (a) popularity prediction performance (

Δ_{p r e d} = 1

day); (b) magnified view of (a) popularity prediction performance; (c) popularity prediction performance (

Δ_{p r e d} = 10

days); (d) magnified view of (c) popularity prediction performance.

Table 1. Main notations used across the whole paper.

Symbol	Description
$C$	a collection of cascades
$S$	a question-answer cascade
M	total number of question-answer cascades
i	sequence number of a question-answer cascade
k	represents the k-th answer
$t_{k}$	the time when the k-th answer being posted
${\tilde{t}}_{k}$	model’s estimate of the time of the k-th answer
$m_{k}$	the marker of the k-th answer
$H_{k}$	observed history up to the k-th answers
$N^{i}$	total number of answers in a question-answer cascade i
$N_{t_{k}}$	the number of answers up to time $t_{k}$
${\tilde{N}}_{t_{k}}$	model’s estimate of the number of answers at time $t_{k}$
$λ$	conditional intensity function
$f (t \| H_{k})$	conditional probability density function
$F (t \| H_{k})$	conditional cumulative distribution function
$L$	the negative logarithmic likelihood function

Table 2. Statistics of the Zhihu dataset used in our experiments.

Topic	#Questions	#Answers	#Comments	#Votes	Average Duration (Days)
Lifestyle	529	133,782	582,868	7,372,546	1165
Economics	751	71,871	278,613	2,809,376	1319
Sports	539	58,885	387,740	3,856,046	1196
Internet	209	37,985	132,640	2,950,844	698
Art	544	60,455	270,134	3,559,153	1249
Read	432	111,631	292,375	4,336,772	1177
Food	194	62,467	342,557	3,834,906	1006
Animation	475	48,005	152,896	1,870,583	410
Movie	254	47,006	136,307	2,296,818	472
Game	132	51,471	206,775	2,604,300	910
…	…	…	…	…	…
Total	8272	1,315,316	5,910,928	74,061,774	979

Table 3. Concentration of total votes among top and early answers (%).

Topic	Nums	Top 10%	Top 20%	Earliest 10%	Earliest 20%
Lifestyle	529	86.0	92.8	26.7	40.8
Economics	751	79.8	88.3	33.3	45.2
Sports	539	82.5	90.3	26.8	37.9
Internet	209	82.5	90.9	30.2	44.2
Art	544	77.1	86.5	26.5	40.2
Read	432	83.1	90.6	29.5	41.4
Food	194	90.5	95.6	28.0	42.9
Animation	475	77.9	86.7	30.6	45.0
Movie	254	82.8	90.8	33.9	48.5
Game	132	79.2	88.9	30.6	47.0
…	…	…	…	…	…
Mean	—	82.1	90.1	29.6	43.3

Table 4. Performance of predicting the time of the next answer.

	Zhihu
Model	RMSE (Day)	${MAPE}_{t}$ (%)
Naive	4.90	39.82
Possion	5.02	118.21
Hawkes	4.83	398.11
RMTPP	4.66	23.44
CYAN-RNN	4.60	23.49
CYAN-RNN(cov)	4.58	23.34
PARPP	4.21	20.87
PARPP-CON(Top1%)	4.17	16.92
PARPP-CON(Top5%)	4.17	16.20
PARPP-CON(Top10%)	4.14	16.01
PARPP-CON(Top15%)	4.11	15.93
PARPP-CON(Top20%)	4.16	15.99

Table 5. Performance of attention mechanism. The table shows the performance of the attention mechanism in aligning the top-k’s most popular answers. Scores in bold indicate the best performance per column.

	P@		NDCG@
Model	10	20	10	20
CYAN-RNN	0.122	0.231	0.107	0.199
CYAN-RNN(cov)	0.115	0.215	0.109	0.191
PARPP_prior	0.135	0.241	0.114	0.206
PARPP-CON(Top1%)_prior	0.157	0.262	0.109	0.196
PARPP-CON(Top5%)_prior	0.159	0.241	0.112	0.203
PARPP-CON(Top10%)_prior	0.159	0.252	0.113	0.206
PARPP-CON(Top15%)_prior	0.170	0.247	0.120	0.216
PARPP-CON(Top20%)_prior	0.170	0.239	0.114	0.208
PARPP_postr	0.155	0.251	0.115	0.209
PARPP-CON(Top1%)_postr	0.156	0.262	0.110	0.199
PARPP-CON(Top5%)_postr	0.157	0.246	0.115	0.212
PARPP-CON(Top10%)_postr	0.166	0.268	0.115	0.210
PARPP-CON(Top15%)_postr	0.171	0.267	0.119	0.220
PARPP-CON(Top20%)_postr	0.170	0.248	0.113	0.206

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Y.; Wen, Z.; Liang, S. Predicting Question Popularity for Community Question Answering. Electronics 2024, 13, 3260. https://doi.org/10.3390/electronics13163260

AMA Style

Wu Y, Wen Z, Liang S. Predicting Question Popularity for Community Question Answering. Electronics. 2024; 13(16):3260. https://doi.org/10.3390/electronics13163260

Chicago/Turabian Style

Wu, Yuehong, Zhiwei Wen, and Shangsong Liang. 2024. "Predicting Question Popularity for Community Question Answering" Electronics 13, no. 16: 3260. https://doi.org/10.3390/electronics13163260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Question Popularity for Community Question Answering

Abstract

1. Introduction

2. Related Work

2.1. Community Question Answering

2.2. Cascade Prediction

2.3. Point Process

2.4. Attention Mechanism

2.5. Matthew Effect

3. Model

3.1. Problem Formulation

3.2. Temporal Point Process

3.3. Recurrent Point Process

3.4. Attention Mechanism

3.5. Convergence Strategy

3.6. Parameter Learning

4. Experimental Setup

4.1. Research Questions

4.2. Dataset

4.3. Matthew Effect in CQA

4.4. Baselines

4.5. Evaluation Metrics

5. Results and Analysis

5.1. Predict the Time of the Next Answer

5.2. Predict the Number of Answers

5.3. Evaluation on Attention Mechanism

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI