A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling †
Abstract
:1. Introduction
2. Description of the Developed Topic Model
- input—each sentence of a collection represented as sequence of symbols; BERT model.
- output—each sentence of a collection represented as embedding.
Algorithm 1. The pseudocode of the developed algorithm. |
Input: Texts—text collection; k—required number of topics; Pre-trained BERT model; d—maximum distance to centroid 1. 2. If then Else 3. ; 4. 5. 6. |
- input—sentence embeddings of a text collection.
- output—a subset of sentence embeddings of a text collection.
- input—a subset of sentence embeddings of a text collection; number of clusters.
- output—clusters of sentence embeddings.
- local criterion—sum of squared distances.
- input—sentence embeddings of a text collection; clusters; clusters’ centroids.
- output—updated clusters.
- input—updated clusters.
- output—matrix F of sentence probabilities.
- n—is the number of texts in our collection, and
- m—is the number of clusters obtained after the previous step of sentence embedding.
- input—F matrix.
- output—estimates of parameters: probability of sentences occurence within a topic and probability of topics occurence within a text.
- local criterion—maximum likelihood value.
3. Dataset and Quality Criterion Estimation
4. Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. BERT
- Addition of a classification layer on top of the encoder output.
- Calculation of the probability of each word in the vocabulary using softmax.
- A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
- A sentence embedding indicating sentence A or sentence B is added to each token.
- A positional embedding is added to each token to indicate its position in the sequence.
- The entire input sequence goes through the model.
- The output of the [CLS] token is transformed into a 2 × 1 shaped vector, using a simple classification layer.
- Calculation of the probability using softmax.
Appendix B. Minimum Sum of Squares Clustering (MSSC)
Appendix C. General EM-Algorithm
References
- Blei, D. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef] [Green Version]
- Boyd-Graber, J.; Hu, Y.; Mimno, D. Applications of topic models. Found. Trends Inf. Retr. 2017, 11, 143–296. [Google Scholar] [CrossRef]
- Reisenbihler, M.; Reutterer, T. Topic modeling in marketing: Recent advances and research opportunities. J. Bus. Econ. 2019, 89, 327–356. [Google Scholar]
- Liu, L.; Tang, L.; Dong, W.; Yao, S.; Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 2019, 5, 1608. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yanina, A.; Golitsyn, L.; Vorontsov, K. Multi-objective topic modeling for exploratory search in tech news. In Proceedings of the Communications in Computer and Information Science, vol 789. AINL-6: Artificial Intelligence and Natural Language Conference, St. Petersburg, Russia, 20–23 September 2017; pp. 181–193. [Google Scholar]
- Mukhamediev, R.; Yakunin, K.; Mussabayev, R.; Buldybayev, T.; Kuchin, Y.; Murzakhmetov, S.; Yelis, M. Classification of Negative Information on Socially Significant Topics in Mass Media. Symmetry 2020, 12, 1945. [Google Scholar] [CrossRef]
- Yakunin, K.; Ionescu, G.; Murzakhmetov, S.; Mussabayev, R.; Filatova, O.; Mukhamediev, R. Propaganda Identification Using Topic Modeling. Procedia Comput. Sci. 2020, 178, 205–212. [Google Scholar]
- Yakunin, K.; Mukhamediev, R.; Mussabayev, R.; Buldybayev, T.; Kuchin, Y.; Murzakhmetov, S.; Yunussov, R.; Ospanova, U. Mass Media Evaluation Using Topic Modeling. Commun. Comput. Inf. Sci. 2020, 1242, 165–178. [Google Scholar]
- Cristani, M.; Tomazolli, C.; Olivieri, F. Semantic social network analysis foresees message flows. In Proceedings of the 8th International Conference on Agents and Artificial Intelligence, ICAART, Roma, Italy, 24–26 February 2016; pp. 296–303. [Google Scholar]
- Hoffmann, T. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence—UAI, Stockholm, Sweden, 30 July–1 August 1999; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; pp. 28–296. [Google Scholar]
- Blei, D.; Ng, A.; Jordan, M. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Apishev, M.; Vorontsov, K. Learning topic models with arbitrary loss. In Proceedings of the 26th Conference of FRUCT (Finnish-Russian University Cooperation in Telecommunications) Association, Yaroslavl, Russia, 23–25 April 2020; pp. 30–37. [Google Scholar]
- Kohedykov, D.; Apishev, M.; Golitsyn, L.; Vorontsov, K. Fast and modular regularized topic modeling. In Proceedings of the 21st Conference of FRUCT (Finnish-Russian University Cooperation in Telecommunications) Association, Helsinki, Finland, 6–10 November 2017; pp. 182–193. [Google Scholar]
- Ianina, A.; Vorontsov, K. Regularized multimodal hierarchical topic model for document-by document exploratory search. In Proceedings of the 25th Conference Of FRUCT (Finnish-Russian University Cooperation in Telecommunications) Association, Helsinki, Finland, 5–8 November 2019; pp. 131–138. [Google Scholar]
- Pagliardini, M.; Gupta, P.; Jaggi, M. Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv 2017, arXiv:1703.02507. [Google Scholar]
- Balikas, G.; Amini, M.; Clausel, M. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 921–924. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805v1. [Google Scholar]
- Rogers, A.; Kovaleva, O.; Rumshisky, A. A primer in BERTology: What we know about how BERT works. Trans. Assoc. Comput. Linguist. 2020, 8, 842–866. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
- Wiedemann, G.; Remus, S.; Chawla, A.; Biemann, C. Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. In Proceedings of the Konferenz zur Verarbeitung natürlicher Sprache/Conference on Natural Language Processing (KONVENS), Erlangen, Germany, 9–11 October 2019. [Google Scholar]
- Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, 1–6 June 2018; Volume 1 (Long Papers), pp. 2227–2237. [Google Scholar]
- Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 328–339. [Google Scholar]
- Bhatia, S.; Lau, J.; Baldwin, T. Automatic labeling of topics with neural embeddings. In Proceedings of the 26th COLING International Conference on Computational Linguistics, Osaka, Japan, 11–16 December 2016; pp. 953–963. [Google Scholar]
- News Aggregator Dataset. Available online: https://archive.ics.uci.edu/ml/datasets/News+Aggregator (accessed on 12 April 2021).
- Gasparetti, F. Modeling user interests from web browsing activities. Data Min. Knowl. Discov. 2017, 31, 502–547. [Google Scholar] [CrossRef]
- Hansen, P.; Mladenović, N. J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognit. 2001, 34, 405–413. [Google Scholar] [CrossRef]
- Gribel, D.; Vidal, T. HG-means: A scalable hybrid genetic algorithm for minimum sum of squares clustering. Pattern Recognit. 2019, 88, 569–583. [Google Scholar] [CrossRef] [Green Version]
- Krassovitskiy, A.; Mladenovic, N.; Mussabayev, R. Decomposition/Aggregation K-means for Big Data. In International Conference on Mathematical Optimization Theory and Operations Research (MOTOR 2020); Communications in Computer and Information Science (CCIS) Book Series; Springer: Cham, Switzerland, 2020; Volume 1275, pp. 409–420. [Google Scholar]
- Franti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
- Arthur, D.; Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
Astra, astrazeneca, pfizer, undervalued, soriot, az, azn, pfe, glaxosmithkline, wyeth. |
After making a failed bid for AstraZeneca in January, pharmaceutical heavyweight Pfizer is again pursuing a deal for its British rival that would rank among the largest in industry history. |
Business secretary says he is committed to maintaining UK’s position in pharma industry, but investors welcome bid Battle lines are being drawn over what would be the biggest foreign takeover of a British company, after the pharmaceutical firm AstraZeneca rejected a £60bn approach from its US rival Pfizer. |
Pfizer’s interest prompted a warning on jobs from Vince Cable but was welcomed by investors, who sent stock in Britain’s largest drug maker up by 14%. |
In addition, US drugs giant Pfizer was rumored to have tabled a $100 bn bid for UK-based AstraZeneca, which prompted the FTSE 100-listed company to announce plans to spin off non-core assets. |
Yesterday, the UK’s benchmark index closed 0.22 percent higher at 6700.16 points with the advance largely due to AstraZeneca (LON: AZN) whose shares soared after Pfizer confirmed its interest in the FTSE 100 company. |
U.S. drug maker Pfizer Inc. approached Britain’s AstraZeneca Plc two days ago to reignite a potential $100 billion takeover and was rebuffed, raising investor expectations it will have to increase its offer to close the deal. |
AstraZeneca shares were up 11.7 percent at $76.69 in New York on news of the latest offer, which would be the biggest foreign acquisition of a British company and one of the largest pharmaceutical deals. |
Anglo-Swedish drugs giant is spurning the advances of Pfizer after the US drugs giant made a £59 bn takeover offer. |
Pfizer Inc. was turned down twice by fellow drug maker AstraZeneca PLC, but the maker of Viagra and Lipitor said Monday that its proposed $100 billion acquisition makes sense for shareholders of both companies, and it’s considering its next steps. |
The Pfizer offer led to a transatlantic stampede for pharma shares yesterday, driving a 50 percent spike in the amount of AstraZeneca shares changing hands in London, helping it close 13 per cent. |
Bandhan, a microfinance entity, and IDFC are likely to get bank licenses after the Election Commission on Tuesday gave the green signal to RBI to announce new entrants in the sector, ending weeks of uncertainty over the crucial reform measure. |
The Reserve Bank of India had sought the commission’s approval to issue new bank licenses to ensure the process would not clash with the code of conduct ahead of elections, which prevents decisions that may be deemed as political from being taken by government officials or regulators. |
The RBI and the government had initially set a deadline of issuing new licenses by the end of March. |
A senior EC official told TOI that the RBI was competent to take its own decisions and the commission agreed that the ongoing polls need not delay its functions as a banking regulator. |
Reserve Bank of India governor Raghuram Rajan left policy rates unchanged and signaled that he was waiting to see whether the post-election Budget would take the path of fiscal prudence before deciding on interest rates. |
The repo rate, or the interest that banks pay when they borrow money from the RBI to meet their short-term fund requirements, has been left unchanged at 8 percent. |
The Reserve Bank of India (RBI) held interest rates today while shifting its liquidity provision to longer-term repurchase operations (repos) as it continues to transform its monetary policy framework. |
Liquidity conditions have tightened in March, partly on account of year-end window dressing by banks, though an extraordinary infusion of liquidity by the Reserve Bank has mitigated the tightness. |
About 1200 Parisian drivers were blocking the Charles de Gaulle and Orly airports this morning and preventing private car services from picking up passengers, said Nadine Annet, vice president at the FNAT taxi association in France. |
Taxi drivers brought parts of London, Paris and other European cities to a standstill on Wednesday as they protested against new private cab apps such as Uber which have shaken up the industry. |
Uber app has caused chaos in London as London’s taxi drivers come out in protest. |
The capital ground to halt today as London’s black taxi drivers took to the streets in protest over Uber. |
During a 24-h protest in Madrid, cab drivers surrounded a car suspected of being a private taxi. |
Transport in major European cities has been disrupted by strikes affecting taxis and rail services. |
There was not a taxi to be found on the streets of Madrid on Wednesday morning after the city’s cab drivers began a 24-h strike to protest against online carpooling companies that match individual drivers and passengers. |
Taxi drivers sowed traffic chaos in Europe’s top cities on Wednesday by mounting one of the biggest ever protests against Uber, a US car service which allows people to summon rides at the touch of a button. |
The protest includes a reported 30,000 cab and limo drivers, from London to Paris to Madrid, who are miffed by the same gripes as their American counterparts—namely, that Uber is swiping their business without abiding by any of their rules. |
In Spain, the Barcelona and Madrid cab unions represent nearly all the cabs in those cities, and they too have scheduled a protest, this protest will be 24 h and will be tougher on the citizens than the other protests. |
The commuters in London, Paris, Madrid, and Berlin faced tough times today as taxi drivers in these cities decided to block the streets of the city to protest against the ride-sharing service Uber. |
Bloomberg reported that over 30,000 taxi and limo drivers participating in the protest drive, creating huge trouble for the commuters as it led to massive traffic jams in tourist centers and shopping districts across Europe. |
Seth Rollins grabbed the briefcase after Kane pulled Dean Ambrose off the ladder, just as he was about to win, and hit him with a tombstone, for good measure. |
Kingston with a dropkick on Swagger followed by boom drop on him as Swagger was laying on top of a ladder. |
Rollins went after Ambrose while everybody was out of the ring, which led to Ambrose hitting a double underhook suplex that sent Rollins into a ladder. |
Rollins hit both of them with a ladder, but then RVD hit Rollins with a dropkick. |
RVD hit Rolling Thunder on Rollins while he was laying on top of a ladder. |
Swagger set up the ladder in the corner of the ring on RVD and wanted a superplex, but RVD fought out. |
Swagger sent Ziggler into a ladder that crashed into Ambrose. |
Then Swagger slammed a ladder onto Kingston followed by a Swagger Bomb that crushed Kingston. |
Kingston gave Rollins a back body drop that sent him crashing into the ladder that was bridged from the ropes. |
With Cesaro near the top of the ladder, Orton yanked him off and hit a RKO off the ladder. |
U.S. drugmaker Pfizer Inc. approached Britain’s AstraZeneca Plc two days ago to reignite a potential $100 billion takeover and was rebuffed, raising investor expectations it will have to increase its offer to close the deal. |
Governor Raghuram Rajan says Reserve Bank of India (RBI) should not be in the business of bailing out banks by infusing cash to make up for year-end distortions and the current policy rate has been appropriately set, the central bank chief said post the policy review on Tuesday. |
The commuters in London, Berlin, Paris and Madrid faced a day of traffic chaos on Wednesday as taxi drivers mounted one of the biggest protests against the threat of Uber, a U.S. car service which allows people to summon rides at the touch of a button. |
Sheamus goes up top but Reigns nails him with a Superman Punch. |
22 Topics | Average Value | Standard Deviation | Max Value |
---|---|---|---|
Criterion 1 | 31.89% | 0.02 | 35.90% |
Criterion 2 | 66.28% | 0.02 | 70.91% |
Cena climbed up and grabbed the WWE & World Title to win the match after 27 min. |
Rusev takes the opportunity to crush his foe and locks in The Accolade for the victory. |
Orton goes to belt town, but this time Roman Reigns is back to break the climb and the two rivals scrap it out on the ladder. |
Cena takes out Kane and Orton, and wins! |
Paige jumps in for another flurry but AJ counters with a roll up for the win. |
Jey comes right behind him with a second splash for the win. |
They make it back in and Paige applies the submission before turning it into a 2 count. |
Kane stands guard and holds the ladder as Rollins climbs up to grab the briefcase for the win. |
They raise Rollins’ arms and the briefcase in victory. |
Kingston knocked Swagger out of the ring with a ladder shot. |
Ambrose and Swagger were sent out of the ring. |
Ambrose ran back into the ring as Rollins was about to win. |
Rusev and Lana entered the ring. |
Triple H and Stephanie McMahon made their way down to ringside. |
He threw Cesaro and Del Rio out of the ring. |
Cena took Del Rio out of the ring. |
Orton sent Sheamus out of the ring. |
Cena gave Kane the Attitude Adjustment. |
Before the health-care law, one insurance company held at least half the individual market in 30 states, according to the Kaiser report. |
Only 56 percent polled said they plan to purchase health insurance. |
As of February 24, the average premium for an individual health plan selected through eHealth without a subsidy was $274 per month, a 39 percent increase over the average individual premium for pre-Obamacare coverage. |
In 2010, when the Affordable Care Act was passed, a single insurer had more than half the individual insurance market in 30 states. |
More than 5 million people have enrolled in private health insurance under Obamacare, according to the administration. |
Nationally, the number of uninsured people in 2012 was estimated at about 47 million. |
Young people are vital to the success of President Barack Obama’s signature healthcare law. |
Regarding people with preexisting conditions, for instance, they point to the high risk insurance pools, which would be managed and subsidized by states. |
They will have to spend thousands of dollars more before their benefits actually take effect. |
The individual penalty increases each year through 2016. |
Now, he will pay $22 a month for his health insurance. |
For instance, a sprained ankle can cost a person $220, while charges for a broken arm average nearly $7700. |
If you remain uninsured, you may be liable for a penalty of up to 1 percent of your income. |
If you wanted to keep roughly the same premium, you often had to double your out-of-pocket costs. |
Premiums were often 25 percent of a person’s income. |
Just over half of uninsured people said they’d started to pay, compared with nearly 9 in 10 of those signing up on exchanges who said they were simply switching from one health plan to another. |
Topic 1 | Topic 2 | Topic 3 | Topic 4 |
---|---|---|---|
uber | wwe | pfizer | rbi |
drivers | match | astrazeneca | inflation |
taxi | ladder | uk | banks |
london | cena | offer | india |
black | him | read | rajan |
app | vs | takeover | cent |
cab | rollins | bid | monetary |
protest | orton | drugs | liquidity |
driver | reigns | inc | january |
cabs | title | bc | repo |
4 Topics | Average Value | Standard Deviation | Max Value |
---|---|---|---|
Criterion 1 | 73.55% | 0.10 | 80.00% |
Criterion 2 | 94.12% | 0.09 | 99.18% |
82 Topics | Average Value | Standard Deviation | Max Value |
---|---|---|---|
Criterion 1 | 20.28% | 0.01 | 22.31% |
Criterion 2 | 55.40% | 0.01 | 57.45% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kozbagarov, O.; Mussabayev, R.; Mladenovic, N. A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling. Symmetry 2021, 13, 837. https://doi.org/10.3390/sym13050837
Kozbagarov O, Mussabayev R, Mladenovic N. A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling. Symmetry. 2021; 13(5):837. https://doi.org/10.3390/sym13050837
Chicago/Turabian StyleKozbagarov, Olzhas, Rustam Mussabayev, and Nenad Mladenovic. 2021. "A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling" Symmetry 13, no. 5: 837. https://doi.org/10.3390/sym13050837
APA StyleKozbagarov, O., Mussabayev, R., & Mladenovic, N. (2021). A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling. Symmetry, 13(5), 837. https://doi.org/10.3390/sym13050837