**3. Methods**

Our approach carries our semantic and linguistic analysis to reveal the health characteristics of patients' questions in online textual questions containing hearing loss-related words. The present study consists of three phases: data collection, topic discovery, and topic extraction. Figure 1 shows the overall procedure of the research analysis.

**Figure 1.** Overall research flow.

### *3.1. Data Collection*

Electronic patient-authored texts on the topic of hearing loss were collected from the social Q&A, namely Naver Knowledge-iN. Launched in 2002, Naver Knowledge-iN is the largest social Q&A community platform in South Korea, where online users can post and share questions related to various topics ranging from insurance policy to medical treatment. Health topics are popular among the questioners. To collect research data, we developed a software program to access and gather the questions posted from 2009 to 2019 on Naver Knowledge-iN. We collected 68,327 questions using the key word of "hearing loss". Repeated or duplicated questions posted were excluded. In addition, questions were excluded if a question contained less than 10 words. As a result, our final sample dataset consists of 65,842 questions that were analyzed for this study.

#### *3.2. Topic Discovery via Latent Dirichlet Allocation (LDA)*

To discover the topics from the collected textual questions, we utilized a topic modeling approach that clusters the semantically associated words with "hearing loss" into subtopics. Topic modeling has been widely applied in health and medical domains such as extracting relevant clinical concepts from patient health records [31], discovering health topics in social media [11,39,40] identifying emerging patterns of clinical events [41], and detecting new disease breakout [42]. Among diverse topic modeling techniques, Latent Dirichlet Allocation (LDA) [43] has gained popularity as a tool for automatic text summarization and visualization. In this study, we apply the LDA model to extract topics from the collected corpus.

The automatic text analysis method is usually divided into supervised and unsupervised methods. The unsupervised method does not classify the text content in advance but reduces the dimension of the text through statistical probability inference and explains the text as a whole by means of the reduced dimension theme. The LDA model is an unsupervised machine learning method that uses a bag-of-words representation method. It utilized a latent variable *topic* between observed variables *document* and *word* to explain the semantic topic distribution of documents. The LDA modeling approach considers each document to be presented as a random mixture over latent topics, where each topic is characterized by a probability distribution over words.

LDA is a generative probabilistic model that assigns sets of words collected from documents to be explained by unobserved topic groups that explain why some parts of the data are similar. Each document consists of a small number of di fferent topics, and each word's generation is attributional to one of the topics of the document. The plate diagram of the LDA model is shown in Figure 2, which helps to explain the components of the LDA model.

**Figure 2.** Plate model representation of Latent Dirichlet Allocation (LDA).

LDA assumes that documents and the words within them are derived from a generative probabilistic model [43]. Here, *k* is the number of topics. *M* is the number of documents and *N* is the number of words within the document. Given a corpus *D* consisting of *M* documents, with documents having *Nd* having (*d* ∈ 1, ... , *<sup>M</sup>*), LDA models *D* according the followings.


$$P(\theta, \, z \vert \, w, \alpha, \beta) = \frac{P(\theta, \, z, w \vert \alpha, \, \beta)}{P(w \vert \, \alpha, \beta)}\tag{1}$$
