Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks

Pilgun, Maria

doi:10.3390/bdcc9040075

Open AccessArticle

Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks

by

Maria Pilgun

^1,2

¹

Department of General and Comparative-Historical Linguistics, Lomonosov Moscow State University, 119991 Moscow, Russia

²

Research Institute of Prospective Directions and Technologies, Russian State Social University, 129226 Moscow, Russia

Big Data Cogn. Comput. 2025, 9(4), 75; https://doi.org/10.3390/bdcc9040075

Submission received: 6 February 2025 / Revised: 17 March 2025 / Accepted: 21 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue Research Progress in Artificial Intelligence and Social Network Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

This study explores explicit and implicit knowledge in large-scale linguistic data and digital footprints from social networks. This research aims to develop and test algorithms for analyzing both explicit and implicit information in user-generated content and digital interactions. A dataset of social media discussions on avian influenza in Moscow (RF) was collected and analyzed (tokens: 1,316,387; engagement: 108,430; audience: 39,454,014), with data collection conducted from 1 March 2023, 00:00 to 31 May 2023, 23:59. This study employs Brand Analytics, TextAnalyst 2.32, ChatGPT o1, o1-mini, AutoMap, and Tableau as analytical tools. The findings highlight the advantages and limitations of explicit and implicit information analysis for social media data interpretation. Explicit knowledge analysis is more predictable and suitable for tasks requiring quantitative assessments or classification of explicit data, while implicit knowledge analysis complements it by enabling a deeper understanding of subtle emotional and contextual nuances, particularly relevant for public opinion research, social well-being assessment, and predictive analytics. While explicit knowledge analysis provides structured insights, it may overlook hidden biases, whereas implicit knowledge analysis reveals underlying issues but requires complex interpretation. The research results emphasize the importance of integrating various scientific paradigms and artificial intelligence technologies, particularly large language models (LLMs), in the analysis of social networks.

Keywords:

social network; explicit knowledge; implicit knowledge; large-scale linguistic data; LLMs; ChatGPT (o1, o1-mini); TextAnalyst 2.32

1. Introduction

Issues related to implicit and explicit knowledge are most often raised in the analysis of learning mechanisms in second language acquisition [1]. Implicit knowledge refers to unconscious information that lies beyond awareness, is not subject to verbalization, and is generally identified through behavioral analysis. Explicit knowledge, on the other hand, is defined as knowledge within the realm of awareness, which can often, but not always, be verbalized (see [2,3,4,5]).

The scientific literature discusses methodological problems related to measuring implicit knowledge [6], and a battery of six grammatical tests has been developed to distinguish between automated implicit and explicit knowledge [7]. Additionally, one possible transformation of implicit information into explicit knowledge has been analyzed and presented [8].

The efficiency of utilizing large datasets has been widely debated [9,10]. The development of neural network technologies and large language models (LLMs) has introduced new solutions for analyzing large-scale linguistic data.

1.1. Neural Network Technologies and Hidden Information Extraction

Neural network tools allow for the application of various models to detect hidden information in text and speech. Researchers argue that the best results are achieved with deep learning models based on bidirectional LSTMs [9]. Hidden information in this context refers to instances where individuals possess knowledge on a subject but conceal it, in contrast to S. Hu’s definition of deception, where someone fabricates knowledge they do not have. S. Hu created a dataset of wine tasters who encoded their reviews using various techniques such as n-grams, LIWC, and Glove embeddings [11]. The LSTM model employing these features achieved an F-score of 71.51 in detecting hidden information, surpassing human performance at 56.28.

1.2. The Impact of Large Language Models (LLMs)

One of the most transformative developments in recent years has been the emergence of large language models (LLMs). Based on deep learning, these models use neural network architectures capable of processing vast amounts of text and generating new content.

Modern LLMs are trained on enormous datasets and progressively develop capabilities beyond text generation, including semantic analysis, metaphor processing, conceptual understanding, and complex linguistic structures. Language modeling has evolved from early statistical and n-gram models [12] to neural network-based pre-trained language models, culminating in today’s advanced LLMs.

A major milestone in LLM development was the introduction of the Transformer architecture [13], which replaced recurrence with a self-attention mechanism, enabling parallel processing and efficient handling of long-term dependencies. This innovation led to the creation of GPT (Generative Pre-trained Transformer, OpenAI) and BERT (Bidirectional Encoder Representations from Transformers, Google) [14], fundamentally advancing natural language processing (NLP).

Today’s LLMs represent a new “revolutionary entity”. Thanks to deep learning and big data, models such as GPT-4o, GPT-4o mini, and GPT-4 effectively perform tasks such as text generation, translation, summarization, question answering, and sentiment analysis [15]. According to a global Nature survey on postdoctoral research, nearly one-third of participants use AI chatbots for refining code, analyzing literature, and other academic tasks [16].

1.3. The Use of New-Type Tools in Scientific Research

DeepMind and Google have been instrumental in expanding AI applications in scientific research, shaping the U.S. research agenda. The report “A New Golden Age of Discovery: Seizing the AI for Science Opportunity” examines how AI is revolutionizing research across disciplines, including biology, physics, chemistry, and meteorology. It highlights five key areas where AI overcomes scalability and complexity barriers, with a particular emphasis on AI’s role in processing scientific data, modeling complex systems, and accelerating experimentation.

A crucial aspect of this transformation is AI’s ability to process textual data. The report discusses how LLMs facilitate literature reviews, hypothesis generation, and the analysis of scientific publications. It also explores examples of AI-driven automatic data extraction from thousands of academic papers. The authors emphasize how language models are reshaping scientific communication, synthesizing complex data, and making it more accessible to specialists and the general public. This includes interpreting scientific findings, generating interactive articles, and translating scientific data for diverse audiences. Moreover, AI-driven linguistic tools are employed for extracting and annotating data from unstructured sources, such as archives and publications, helping to organize vast information repositories [17].

1.4. Promising Research Directions Using New-Type Tools

Scientists continue to integrate new technological solutions into their research. For example, researchers at the University of Texas have developed a GPT-based decoder that translates thoughts into text using non-invasive MRI scanning. While the new method can pinpoint brain activity with high resolution, there is a significant time delay, preventing real-time brain activity tracking. Nonetheless, LLMs underlying ChatGPT o1 can numerically represent the semantic meaning of speech, enabling researchers to identify neural activity patterns corresponding to specific linguistic constructs rather than decoding word-by-word sequences [18].

Manipulations that allow the extraction of harmful information from LLMs have also been analyzed [19]. Additionally, researchers from Google DeepMind and Stanford have presented an analysis of improved fact verification in LLMs, demonstrating their superhuman capabilities at a fraction of the cost of human fact-checkers [20].

Meanwhile, the study of explicit and implicit information in large volumes of linguistic data and digital traces from social media still requires thorough development.

2. Materials and Methods

This research contributes to the ongoing exploration of implicit and explicit knowledge in large-scale linguistic data and digital footprints, providing new perspectives on AI-driven linguistic analysis and its role in social discourse interpretation.

Research Objective

The goal of this study is to develop and test algorithms for analyzing and interpreting implicit and explicit information in social media data, including user-generated content and digital footprints.

Hypothesis

When analyzing user-generated content in social media, the analysis of explicit information alone is insufficient, as actors in virtual communication often conceal their true evaluations, intentions, and motivations. They choose communication behavior patterns in accordance with specific communicative goals. Deep-seated motives and opinions of actors can be identified through the analysis of implicit information. A contaminated approach, which combines the analysis of both explicit and implicit information, allows for a more accurate identification of actors’ true evaluations, a deeper understanding of subtle emotional and contextual nuances, and the specific ways in which actors perform their communicative roles. This is particularly crucial for studying public opinion and social tension.

Research Questions

What algorithmic approaches can be developed and empirically tested for the automated analysis of explicit information contained in user-generated content and digital traces in social media?
What algorithms can be developed for detecting implicit information in user-generated content and digital traces in social media, and how can the validation of results be conducted?
What are the key advantages and limitations of different algorithms in analyzing explicit and implicit information in user-generated content on social media?

2.1. Materials

The empirical material used in this study consists of user-generated content related to the spread of avian influenza in Moscow.

The dataset was compiled using the Brand Analytics system for monitoring and analyzing social media and mass media (https://brandanalytics.ru/) (accessed on 10 June 2023).

Data collection was conducted from 1 March 2023, 00:00 to 31 May 2023, 23:59.

The quantitative characteristics of the dataset are presented in Table 1.

The dynamics of the number of mentions, engagement, and audience are presented in Figure 1, Figure 2 and Figure 3.

2.2. Methods

For the analysis of explicit and implicit information, two corresponding algorithms were developed:

2.2.1. Algorithm for Explicit Knowledge Analysis

To analyze explicit information contained in user-generated content and digital traces in social media, an algorithmic approach was developed, as presented below.

After data collection and cleaning, the following procedures were carried out: identification and classification of actor types, audience analysis, and engagement assessment; an examination of audience characteristics and engagement levels. Subsequently, the thematic structure was extracted, and key content topics were identified. Sentiment analysis facilitated database clustering by sentiment, after which a semantic network was constructed for each cluster, and the core of the semantic network was identified (Figure 4).

2.2.2. Algorithm for Implicit Knowledge Analysis

When developing an algorithm utilizing neural network tools for detecting implicit information in user-generated content and digital traces in social media, special attention was given to the analysis of digital aggression and associative networks. In the analysis of implicit knowledge, content was clustered based on the presence or absence of aggression. Unlike sentiment, aggression provides a more accurate measure of emotional tension and the hidden intentions of actors. Special attention was also given to analyzing the associative network and lexical associations, which reveal the hidden judgments of actors and enable the analysis of implicit knowledge and actors’ perceptions (Figure 5).

The proposed approach was developed during the implementation of practical urban planning projects starting in 2019. Specific stages of the research design formulation are presented, in particular, in [21,22].

2.2.3. Tools and Methodology

GenAI tool was used for data collection, analysis, and interpretation in this study.

Data Collection

Data were collected using the Brand Analytics system (https://brandanalytics.ru/) (accessed on 10 June 2023).

Data Analysis and Interpretation

For content analysis, the AutoMap text analysis tool, developed by CASOS at Carnegie Mellon University, was used. According to its description:

“AutoMap is part of a text mining suite that includes a series of pre-processors for cleaning raw texts so that they can be processed, and a set of post-processors that employ semantic inferencing to improve coding and deduce missing information” (http://casos.cs.cmu.edu/projects/automap/, accessed on 10 June 2023).

Thematic Structure Extraction, Summarization, Sentiment Analysis, Aggression Analysis, Clustering, and Semantic and Associative Network Analysis.

These tasks were conducted using TextAnalyst 2.32 and ChatGPT (o1 and o1-mini).

TextAnalyst 2.32

The neural network technology TextAnalyst 2.32 enables the analysis of social media content. Specifically, it allows for content analysis with automatic generation of a thematic tree with hyperlinks, the formation of a semantic network with hyperlinks, semantic search that considers hidden semantic relationships, automatic text summarization with semantic emphasis extraction, information clustering, analysis of text distribution across thematic classes, automatic text indexing that converts it into hypertext, ranking all types of information based on the semantic content of the text, and automated creation of a full-text knowledge base with a hypertext structure and associative access to information. (https://www.analyst.ru/).

ChatGPT (o1 and o1-mini)

O1 is an AI-powered tool based on ChatGPT, providing access to an enhanced GPT-4 version. This version is optimized for handling large datasets, enabling efficient analysis of complex linguistic structures and the extraction of both explicit and implicit knowledge. Due to its increased computational capacity and improved data processing algorithms, o1 is particularly suitable for detailed analysis of large-scale text corpora, including social media data, and demonstrates high accuracy in NLP tasks.

Key Features of O1:

Expanded context memory, allowing for the processing of longer and more complex texts.

Enhanced ability to detect hidden linguistic relationships and patterns in digital footprints.

High performance for tasks involving the analysis of large datasets.

O1-Mini is a lightweight version of the primary plan, providing access to the same core GPT-4 language model but with limited computational resources. This version is more compact and cost-efficient, making it suitable for less resource-intensive tasks. O1-Mini is particularly useful for analyzing individual texts, small data samples, or performing rapid queries within research projects where processing speed with acceptable accuracy is a priority.

Key Features of O1-Mini:

Reduced context memory, optimized for more compact tasks.

Fast query processing while maintaining basic analytical quality.

Ideal for working with fragments of digital footprints and social media data.

Both o1 and o1-mini can be used as tools for analyzing explicit and implicit knowledge, where o1 is recommended for in-depth studies on large datasets, and o1-mini is suitable for real-time analysis and processing of smaller datasets (https://chatgpt.com).

Data Visualization

For data visualization, the Tableau visual analytics platform was used. Tableau is a business intelligence (BI) software that allows users to connect to spreadsheets or data files and create interactive data visualizations (https://public.tableau.com).

The choice of methodology for data collection, analysis, and interpretation was determined by the author’s research experience, which demonstrated that this particular toolset allows for the most accurate achievement of the study’s objectives.

3. Results

3.1. Explicit Knowledge Analysis

The testing of the developed algorithmic approach for the automated analysis of explicit information contained in user-generated content and digital traces in social media enabled the analysis of content and yielded the following results.

When seeking information and discussing issues related to the spread of avian influenza in Moscow, actors preferred the channels Mash, Ran’she vseh. Nu pochti, and RIA Novosti (Appendix A).

An analysis of message types based on the audience shows a predominance of original posts over reposts (Figure 6).

An analysis of the sentiment of digital footprints revealed that negative reactions dominate (Figure 7).

Analysis of Thematic Structure in User-Generated Content and Semantic Network Core Analysis

The analysis of the thematic structure of user-generated content and the core of the semantic network (Appendix B) allowed for the identification of semantic focal points that were most important to actors discussing the potential spread of avian influenza in Moscow.

Particular emphasis was placed on semantic accents related to the quarantine measures introduced by the Mayor of Moscow and the ban on public events involving animals, which were perceived negatively by the actors—drawing parallels to the quarantine measures during the COVID-19 pandemic.

Top Stories from the Negative Cluster That Gained the Most Audience Reach

Audience: 2,702,680

“Sobyanin imposed quarantine in several districts of Moscow due to avian influenza”.

Areas designated as high-risk zones include Brateyevo, Maryino, Lyublino, and Pechatniki.

During the quarantine period, restrictions include:

A ban on the import and export of birds and hatching eggs,

A ban on the harvesting and transportation of bird feed,

A ban on agricultural fairs.

Quarantine measures for avian influenza are not uncommon—similar restrictions were imposed in Moscow in January 2022, when an outbreak was detected in Gorky Park. (Telegram).

Audience: 1,801,397

“Quarantine in Moscow declared due to seagulls found at Borisov Ponds”.

The Federal Animal Health Protection Center conducted an investigation into dead birds and confirmed cases of highly pathogenic avian influenza.

A special anti-epizootic commission determined that the areas where the dead seagulls were found would be classified as potential virus transmission zones, leading to the imposition of quarantine.

Additionally, inspections were conducted at major poultry farms in Moscow and its surrounding areas, with warnings issued to facilities found in violation of safety measures.

According to estimates, Moscow has 10,800 birds in 340 private households and 16 farming enterprises. (Telegram)

Audience: 1,801,397

“Moscow imposes quarantine again”.

Sergei Sobyanin announced a ban on public events involving animals in several districts of the city due to the spread of avian influenza.

The following districts are affected: Brateyevo, Kapotnya, Maryino, Lyublino, Pechatniki, Moskvorechye-Saburovo, Tsaritsyno, Biryulyovo Vostochnoye, Northern and Southern Orekhovo-Borisovo, and Zyablikovo. (Telegram)

Key Assessments and Opinions Expressed by Users on Avian Influenza

Based on the explicit knowledge analysis algorithm, the following thematic blocks were identified:

Assessment of City Administration’s Actions

Positive Opinions:

The quarantine measures in Moscow are seen as a rapid and justified response to prevent the spread of avian influenza.

Mayor Sergei Sobyanin’s role in ensuring public safety and crisis management is acknowledged.

Restrictions, such as limiting poultry transport, feed, and fairs, are viewed as necessary steps to minimize risks.

Critical Remarks:

Some users perceive the city administration’s actions as excessive or as a distraction from other issues (e.g., economic concerns).

There are speculations that the quarantine measures serve political or economic purposes rather than being solely for public health protection.

2.: Perceived Threat of Avian Influenza

Risk Perception:

Many users emphasize that avian influenza poses a serious threat, especially due to the virus’s potential to cross species barriers, increasing the risk of human infection.

Concerns are raised over possible virus mutations, which could make it more dangerous for humans.

Skepticism and Criticism:

Some users doubt the real threat of avian influenza to humans, considering that transmission from birds to humans is rare.

The topic of “historical panic” is brought up, suggesting that epidemics are often exaggerated by the media.

3.: Media Coverage of the Issue

Positive Views:

The importance of media coverage in informing the public about precautionary measures is recognized.

Expert statements on avian influenza risks and preventive measures are generally well-received.

Critical Remarks:

Some users believe that avian influenza is exaggerated in media reports to create sensational headlines.

Others argue that the media presents a one-sided narrative, focusing on the threats without offering solutions.

4.: Impact of Quarantine on Daily Life

Concerns Expressed:

Users worry that quarantine restrictions in Moscow’s southern districts could disrupt daily life, particularly affecting access to poultry products.

Economic losses for farmers and small businesses reliant on poultry trade are a major concern.

Supportive Views:

Many acknowledge that such measures are necessary to prevent further spread of the virus.

Trade restrictions on poultry are seen as a valid precaution to protect public health.

5.: Expert Opinions

Scientific Approach Endorsement:

Comments from immunologists and epidemiologists stress the importance of strict quarantine measures to prevent virus transmission.

The need for rapid intervention and close monitoring of infection hotspots is emphasized.

Discussion on Virus Mutation Risks:

Experts warn about the potential emergence of more dangerous virus strains, which strengthens public support for strict containment measures.

6.: Social Well-Being

Emotions and Anxiety:

Many users share their personal experiences and concerns about quarantine restrictions, noting that lockdowns have become a recurring part of life in recent years.

Some feel that information on avian influenza adds to overall public stress and fatigue from epidemics.

Calls for Solidarity:

Certain users highlight the importance of collective action in fighting infections.

There are calls for vigilance and adherence to precautionary measures to prevent new outbreaks.

Key Findings

The open discussion surrounding avian influenza revolves around several key themes:

Evaluation of the city’s response (both supportive and critical perspectives).

Perceived threat level (ranging from serious concern to skepticism).

Impact of quarantine measures on daily life (disruptions and economic implications).

Media coverage (informative value vs. sensationalism).

Both positive and critical opinions coexist, reflecting a balance between trust in precautionary measures and skepticism about their motives and scale.

3.2. Analysis of Implicit Knowledge

Empirical testing of the developed algorithm for detecting implicit information in user-generated content and digital traces in social media enabled content analysis and yielded the following results.

In the analysis of implicit knowledge, aggression analysis and associative network analysis play a particularly significant role.

Aggression analysis in digital footprints is one of the most important indicators of actors’ attitudes toward an event. In heightened emotional contexts, true opinions and evaluations tend to emerge as users lose control over their expressions.

A low level of aggression in the digital footprints of actors indicates the absence of potential internal conflict in the development of the analyzed situation (Figure 8).

Stories That Evoked the Strongest Negative Emotional Reactions Among Muscovites

The most emotionally negative reactions among Muscovites were also linked to the quarantine measures and the ban on public events involving animals.

Top Aggression Cluster Stories That Gained the Widest Audience Reach

Audience: 228,049

“Sobyanin declared quarantine due to avian influenza in several districts of the capital”. (Telegram)

Audience: 15,990

“Sobyanin declared quarantine in Moscow districts due to avian influenza”.

The Mayor of Moscow, Sergei Sobyanin, declared quarantine in several districts of the capital due to the avian influenza outbreak. (Telegram)

Audience: 11,447

“Moscow Mayor Sergei Sobyanin declared quarantine in several districts of the capital due to avian influenza”.

The following districts in Moscow were designated as potential outbreak zones: Brateyevo, Kapotnya, Maryino, Lyublino, Pechatniki, Moskvorechye-Saburovo, Tsaritsyno, Biryulyovo Vostochnoye, Northern and Southern Orekhovo-Borisovo, and Zyablikovo.

Comment: “F* that, I’m not going to you”. (Telegram)

Implicit Knowledge Extraction Through Associative Network and Lexical Analysis

The formation of an associative network, analysis of lexical associations, and content analysis of user reactions provided insights into implicit knowledge embedded in user-generated content.

Formation and study of the associative network with the stimulus Ptichij gripp (Avian Influenza) (10/5401) (Appendix C). The associative network analysis with the stimulus Sobyanin (Moscow Mayor) (10/5466) allowed for an examination of hidden assessments and opinions among Muscovites. The analysis revealed associations between the avian influenza outbreak of May 2023 and the COVID-19 pandemic.

Within the aggression cluster, which reflects the strongest negative reactions, the quarantine measures and the ban on public events involving animals were not perceived as a necessary response but rather as measures taken for political or economic advantage (Appendix D).

Implicit Evaluations and Knowledge in the Context of the Avian Influenza Discussion

Concerns Over the Escalation of the Issue

Quarantine imposed in 11 Moscow districts, including densely populated areas, despite official claims that poultry products remain safe.

The mention of the “species barrier”—which the virus has allegedly crossed—intensifies fears of a pandemic, despite the rarity of avian-to-human transmission.

Fear of an epidemic outbreak expanding beyond control.

Distrust in official information and a tendency to seek alternative explanations in social media.

2.: Distrust in the Media and Perceived Media Manipulation

Suspicion that the avian influenza narrative is being used to distract from other events.

The high frequency of media mentions is perceived as an artificial escalation of the issue.

3.: Uncertainty About Consequences for the Public

Concerns about quarantine measures impacting the local economy, access to poultry products, and daily life.

The potential increase in social tension due to uncertainty and economic losses for businesses and households.

Key Findings

The analysis of implicit knowledge, conducted on the same database as the analysis of explicit information, allowed for a more comprehensive examination, uncovering latent emotions and actors’ intentions. The specificity of implicit knowledge analysis made it possible to identify underlying nuances, opinions, and evaluations that actors either cannot or do not wish to express openly.

Notably, in open discussions about avian influenza, actors primarily focused on evaluating the actions of the city administration, the degree of actual threat, the potential impact on daily life, and media coverage of the issue. However, the analysis of implicit information revealed that behind these discussions of specific problems lay a fear of the potential scale of the epidemic and the perceived risk of another pandemic outbreak.

Moreover, the implicit fear of a new epidemic appears to reinforce distrust toward the media and official statements.

The validation of the research results was conducted using two models: TextAnalyst 2.32 and ChatGPT (o1 and o1-mini).

4. Discussion

The challenges encountered in developing algorithmic approaches for analyzing explicit and implicit information contained in user-generated content and digital traces in social media have highlighted the importance of adapting new neural network tools and LLMs for studying large volumes of linguistic data.

Neural network technologies and Large Language Models (LLMs) have introduced new possibilities for linguistic data analysis, particularly for examining explicit and implicit knowledge in large-scale linguistic datasets and digital footprints from social media.

However, the use of this new technology presents significant challenges, requiring careful examination and professional discussions within linguistic and AI research communities.

4.1. Challenges in the Application of LLMs

One of the key issues is that LLMs operate on principles fundamentally different from traditional linguistic approaches.

For example, in their study “Natural Language Processing RELIES on Linguistics”, Juri Opitz, Shira Wein, and Nathan Schneider raise an important question: how have LLMs transformed NLP, leading many to believe that linguistics is no longer essential for this field? However, the authors argue that linguistics remains a crucial component in NLP, identifying six key aspects (acronym RELIES) where it plays a fundamental role:

Resources—Linguistic expertise is essential for creating, annotating, and curating language data, especially for low-resource and endangered languages.

Evaluation—Reliable model evaluation methods depend on linguistic categories and psycholinguistic diagnostics, ensuring more accurate interpretations.

Low-resources—Linguistic approaches help address challenges in data-scarce environments, where language resources or computational power are limited.

Interpretability and Explanation—Linguistics provides a metalanguage and analytical tools for explaining the behavior and decision-making of language models.

Study of Language—Linguistics remains a driving force behind NLP research, influencing corpus linguistics, lexicography, and digital humanities.

This discussion highlights that, despite the progress of LLMs, linguistics is still essential for developing robust and interpretable language technologies. This synthesis of knowledge opens new interdisciplinary research perspectives and practical applications in NLP [23].

The relationship between contemporary linguistic methods and LLM advancements, as well as how deep learning technologies and language models transform scientific approaches to language studies, can be examined from several perspectives:

4.2. Methods of Modern Linguistics

Modern linguistics offers a wide range of methods for analyzing and describing language, from syntactic models to pragmatic approaches to text interpretation. It is crucial to identify methods that apply both to classical linguistics and machine learning-based NLP.

4.2.1. Syntactic Analysis

One of the core linguistic methods is syntactic analysis, which identifies sentence structures and grammatical relationships between words. Linguistic theories describe grammatical structures in detail, explaining how words interact within a sentence.

LLMs, such as GPT, successfully employ syntactic analysis to generate coherent texts. However, their approach differs from classical linguistic parsing models. Instead of rule-based parsing, LLMs rely on probabilistic methods, predicting the most likely syntactic structures based on massive training datasets.

4.2.2. Semantic Analysis

Linguistic semantics also differs significantly between traditional linguistic approaches and LLMs.

LLMs learn to interpret word meanings in context, enabling them to perform tasks such as question answering and machine translation.

Distributional semantics methods, such as word embeddings, have become fundamental for training LLMs. These models use contextual word representations, allowing them to infer word meanings based on surrounding words and discourse structure.

4.2.3. Pragmatic Analysis

Pragmatics in linguistics studies how language is used in real communication scenarios, including context, speaker intentions, and the impact of utterances.

LLMs face challenges in interpreting not only syntactic and semantic aspects of language but also pragmatic nuances. For example, models must understand user intent when generating chatbot responses or creating texts based on a query.

Within LLMs, pragmatic aspects are particularly difficult to automate because these models are trained on static textual data, where context and communicative goals may be ambiguous or incomplete. Nonetheless, modern LLMs are evolving to incorporate elements of pragmatic analysis, improving their ability to generate contextually appropriate responses.

4.3. Linguistic Principles and Their Influence on LLMs

The Principle of Compositionality

The principle of compositionality states that the meaning of an entire sentence is determined by the meanings of its individual parts and how they are combined.

This principle plays a crucial role in LLM training, as models must consider how words and phrases interact to form meaningful text.

LLMs successfully apply this principle through multi-layered transformers, which enable models to account for contextual relationships at different levels and construct complex connections between words and phrases.

However, despite their success in generating coherent text, LLMs struggle with cases where compositionality does not work explicitly, such as processing metaphors or idiomatic expressions.

Thus, methods and principles of modern linguistics remain critical for understanding how large language models such as GPT function.

4.4. Integration of LLMs into Linguistic Research

Rather than replacing linguistic research, it seems more reasonable to integrate LLMs into contemporary linguistic studies by incorporating them as part of the analytical toolkit for linguistic data analysis. The present study serves as an example of such an approach.

A more serious challenge, however, lies in LLM hallucinations, errors, and the generation of false content [20,24,25]. To address this problem, researchers propose using multiple parallel models, which can improve the validity and verification of results.

In this study, we employed multiple models, including:

TextAnalyst 2.32
ChatGPT (o1 and o1-mini)

By comparing the results of these models, we were able to assess the reliability of the findings.

The development of traditional linguistic methods within the context of modern LLMs highlights the need for integrating linguistic theories with technological innovations.

Thus, large language models (LLMs), such as GPT, offer new tools and opportunities for analyzing Large-Scale Data in the digital environment. However, their application requires careful consideration of the challenges and limitations they introduce to social network analysis.

5. Conclusions

The implementation of the research hypothesis allowed us to address key research questions and identify the unique characteristics of each approach. During this study, the hypothesis was confirmed: the contaminated approach, which combines the analysis of both explicit and implicit information, enabled a more accurate identification of subjects’ true evaluations, a deeper understanding of subtle emotional and contextual nuances, and the formation of a comprehensive picture of actors’ opinions and intentions. The validation of the research results was enabled by the application of two models: TextAnalyst 2.32 and ChatGPT (o1 and o1-mini).

As a result of this study, algorithms for analyzing explicit and implicit information in user-generated data and their digital footprints in social media were proposed and tested.

5.1. Explicit Information Analysis

The developed algorithm enabled the identification of openly expressed user opinions and evaluations. The application of thematic analysis and sentiment analysis helped extract key discussion topics in social media, assess the tonality of content, and measure user engagement. Additionally, the formation of a semantic network and the interpretation of its core elements allowed us to identify semantic focal points that were most significant to users in discussions about the spread of avian influenza.

The key advantages of this algorithm include:

A high degree of automation in processing large datasets.

Relatively high accuracy in identifying thematic focal points.

However, its limitations are related to:

Dependence on data quality and inconsistencies in digital interactions, which do not always accurately reflect actors’ real opinions and evaluations.

5.2. Implicit Information Analysis

The second algorithm focused on identifying hidden evaluations and opinions of actors, allowing for a more precise understanding of users’ actual perceptions.

The core aspects of this algorithm included:

Formation of associative networks

Analysis of lexical associations

Detection of users’ emotional intentions

Implicit information analysis demonstrated the ability to detect latent emotions and intentions that are not explicitly expressed in statements but are crucial for understanding the underlying motives of actors.

The key advantage of this approach is its ability to interpret hidden aspects of discourse. However, the complexity of model calibration and result interpretation necessitates expert involvement. Implicit information analysis complements explicit analysis by allowing for a deeper understanding of subtle emotional and contextual nuances, which is particularly valuable for studying public opinion, social well-being, societal tensions, and predictive analytics.

This study underscores the importance of integrating traditional and contemporary data analysis methods, thereby establishing a foundation for further exploration of explicit and implicit knowledge in social network analysis.

A promising research direction is the development of multimodal data analysis, as digital media texts incorporate elements of various modalities (written and spoken text, static and dynamic video sequences, mixed and remixed video formats, 3D–7D objects), which are integrated into a single message.

The integration of LLMs with image recognition models and neural network analytical tools can enable a more comprehensive content analysis, particularly in the study of explicit and implicit knowledge.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study.

Acknowledgments

The author expresses gratitude to the developers of GenAI tool and the intelligent systems used for data collection, analysis, and interpretation in this study: Brand Analytics (https://brandanalytics.ru/) (accessed on 10 June 2023), AutoMap (http://casos.cs.cmu.edu/projects/automap/) (accessed on 10 June 2023), TextAnalyst 2.32 (https://www.analyst.ru/) (accessed on 10 June 2023), ChatGPT (o1 and o1-mini) (https://chatgpt.com) (accessed on 10 June 2023), and Tableau (https://public.tableau.com) (accessed on 10 June 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Ranking of actors by audience size.

Figure A1. Rating of content-generating actors according to the size of their audience. Actor names are given in their original spelling.

Appendix B

The core of the semantic network.

Figure A2. The core of the semantic network. The nominations forming the core of the semantic network are given in their original spelling.

Appendix C

Associative network with stimulus Ptichij gripp (Avian Influenza) (10/5401).

Figure A3. Associative network with stimulus Ptichij gripp (Avian Influenza) (10/5401). The nominations forming the associative network are given in their original spelling.

Appendix D

Associative network with incentive stimulus Sobyanin (10/5466).

Figure A4. Associative network with incentive stimulus Sobyanin (10/5466). The nominations forming the associative network are given in their original spelling.

References

Hulstijn, J.H. Theoretical and empirical issues in the study of implicit and explicit second-language learning. Stud. Second Lang. Acquis. 2025, 27, 129–140. [Google Scholar]
DeKeyser, R.M. Cognitive–psychological processes in second language learning. In Handbook of Second Language Teaching; Long, M., Doughty, C., Eds.; Oxford Blackwell: Oxford, UK, 2009; pp. 119–138. [Google Scholar]
Dörnyei, Z. The Psychology of Second Language Acquisition; Oxford University Press: New York, NY, USA, 2009. [Google Scholar]
Reber, A.S. Implicit Learning and Tacit Knowledge: An Essay on the Cognitive Unconscious; Clarendon Press: London, UK, 1993. [Google Scholar]
Williams, J.N. Implicit learning in second language acquisition. In The New Handbook of Second Language Acquisition; Ritchie, W., Bhatia, T.K., Eds.; Emerald Group Publishing: Bingley, UK, 2012; pp. 319–344. [Google Scholar]
Suzuki, Y.; DeKeyser, R.M. The interface of explicit and implicit knowledge in a second language: Insights from individual differences in cognitive aptitudes. Lang. Learn. 2017, 67, 747–790. [Google Scholar] [CrossRef]
Suzuki, Y. Validity of new measures of implicit knowledge: Distinguishing implicit knowledge from automatized explicit knowledge. Appl. Psycholinguist. 2017, 38, 1229–1261. [Google Scholar] [CrossRef]
Dingli, A. Knowledge Annotation: Making Implicit Knowledge Explicit; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Halevy, A.; Norvig, P.; Pereira, F. The Unreasonable Effectiveness of Data. March/April 2009 IEEE Intelligent Systems. Expert Opinion. 2024. Available online: https://docs.yandex.ru/docs/view?tm=1737979574&tld=ru&lang=en&name=2009-halevy.pdf&text=The%20Unreasonable%20Effectiveness%20of%20Data&url=https%3A%2F%2Fgwern.net%2Fdoc%2Fai%2Fscaling%2F2009-halevy.pdf&lr=213&mime=pdf&l10n=ru&sign=cdb15b7924ace0bdea216e42b0a6fbd0&keyno=0&serpParams=tm%3D1737979574%26tld%3Dru%26lang%3Den%26name%3D2009-halevy.pdf%26text%3DThe%2BUnreasonable%2BEffectiveness%2Bof%2BData%26url%3Dhttps%253A%2F%2Fgwern.net%2Fdoc%2Fai%2Fscaling%2F2009-halevy.pdf%26lr%3D213%26mime%3Dpdf%26l10n%3Dru%26sign%3Dcdb15b7924ace0bdea216e42b0a6fbd0%26keyno%3D0 (accessed on 1 December 2024).
Sun, C.; Sun, A.; Shrivastava, A.; Shrivastava, S.; Singh, S.; Singh, A.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. arXiv 2017, arXiv:1707.02968v2. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Arisoy, E.; Sainath, T.N.; Kingsbury, B.; Ramabhadran, B. Deep neural network language models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, Montreal, QC, Canada, 8 June 2012. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Hu, S. Detecting concealed information in text and speech. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 402–412. [Google Scholar]
Hadi, M.; Qasem, U.; Tashi, A.; Qureshi, R. A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. TechRxiv. 2023. Available online: https://arxiv.org/pdf/2303.18223 (accessed on 1 January 2024).
Nordling, L. How ChatGPT is transforming the postdoc experience. Nature 2023, 622, 655–657. [Google Scholar] [PubMed]
Griffin, C.; Wallace, D.; Mateos-Garcia, J.; Schieve, H.; Kohli, P. A New Golden Age of Discovery. Seizing the AI for Science Opportunity. 2024. Available online: https://deepmind.google/public-policy/ai-for-science/ (accessed on 1 December 2024).
Tang, J.; LeBel, A.; Jain, S.; Huth, A.G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. 2023, 26, 858–866. [Google Scholar] [CrossRef] [PubMed]
Zeng, Y.; Lin, H.; Zhang, J.; Yang, D.; Jia, R.; Shi, W. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. arXiv 2004, arXiv:2401.06373v2. [Google Scholar]
Wei, J.; Yang, C.; Song, X.; Lu, Y.; Hu, N.; Huang, J.; Tran, D.; Peng, D.; Liu, R.; Huang, D.; et al. Long-form factuality in large language models. arXiv 2024, arXiv:2403.18802v4. [Google Scholar] [CrossRef]
Kharlamov, A.A.; Pilgun, M.A. Cognitive Studies in the Interpretation of Social Media Data: TextAnalyst and ChatGPT. Pattern Recognit. Image Anal. 2024, 34, 597–609. [Google Scholar] [CrossRef]
Pilgun, M.; Koreneva, O. Información implícita y explícita en la percepción del covid-19 en los medios de comunicación social en español, alemán y ruso. Palabra Clave 2022, 25, e2513. [Google Scholar]
Opitz, J.; Wein, S.; Schneider, N. Natural Language Processing RELIES on Linguistics. arXiv 2024, arXiv:2405.05966v3. [Google Scholar] [CrossRef]
Kim, H.; Sclar, M.; Zhou, X.; Le Bras, R.; Kim, G.; Choi, Y.; Sap, M. Fantom: A benchmark for stress-testing machine theory of mind in interactions. arXiv 2023, arXiv:2310.15421. [Google Scholar]
Kim, S.; Suk, J.; Cho, J.Y.; Longpre, S.; Kim, C.; Yoon, D.; Son, G.; Cho, Y.; Shafayat, S.; Baek, J.; et al. Fine-grained Evaluation of Language Models with Language Models. Computation and Language (cs.CL). arXiv 2024, arXiv:2406.05761. [Google Scholar]

Figure 1. The Dynamics of mentions of the investigated issue in the collected database.

Figure 2. The dynamics of actors engagement in the discussion of the issues in the collected database.

Figure 3. The dynamics of audience activity in the discussion of the issue in the collected database.

Figure 4. Algorithm for explicit knowledge analysis.

Figure 5. Algorithm for implicit knowledge analysis.

Figure 6. Message type by audience and engagement.

Figure 7. Sentiment of digital footprints.

Figure 8. Aggression in digital footprints.

Table 1. Quantitative characteristics of the dataset.

Parameter	Data
Messages	2061
Authors	1454
Tokens	1,316,387
Engagement	108,430
Audience	39,454,014

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pilgun, M. Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks. Big Data Cogn. Comput. 2025, 9, 75. https://doi.org/10.3390/bdcc9040075

AMA Style

Pilgun M. Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks. Big Data and Cognitive Computing. 2025; 9(4):75. https://doi.org/10.3390/bdcc9040075

Chicago/Turabian Style

Pilgun, Maria. 2025. "Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks" Big Data and Cognitive Computing 9, no. 4: 75. https://doi.org/10.3390/bdcc9040075

APA Style

Pilgun, M. (2025). Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks. Big Data and Cognitive Computing, 9(4), 75. https://doi.org/10.3390/bdcc9040075

Article Menu

Explicit and Implicit Knowledge in Large-Scale Linguistic Data and Digital Footprints from Social Networks

Abstract

1. Introduction

1.1. Neural Network Technologies and Hidden Information Extraction

1.2. The Impact of Large Language Models (LLMs)

1.3. The Use of New-Type Tools in Scientific Research

1.4. Promising Research Directions Using New-Type Tools

2. Materials and Methods

2.1. Materials

2.2. Methods

2.2.1. Algorithm for Explicit Knowledge Analysis

2.2.2. Algorithm for Implicit Knowledge Analysis

2.2.3. Tools and Methodology

3. Results

3.1. Explicit Knowledge Analysis

3.2. Analysis of Implicit Knowledge

4. Discussion

4.1. Challenges in the Application of LLMs

4.2. Methods of Modern Linguistics

4.2.1. Syntactic Analysis

4.2.2. Semantic Analysis

4.2.3. Pragmatic Analysis

4.3. Linguistic Principles and Their Influence on LLMs

The Principle of Compositionality

4.4. Integration of LLMs into Linguistic Research

5. Conclusions

5.1. Explicit Information Analysis

5.2. Implicit Information Analysis

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI