Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian

Jian, Xueyan; Yuan, Wen; Yuan, Wu; Gao, Xinqi; Wang, Rong

doi:10.3390/info16040324

Open AccessArticle

Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian

by

Xueyan Jian

^1,2,

Wen Yuan

^1,*,

Wu Yuan

³,

Xinqi Gao

^1,2 and

Rong Wang

^1,2

¹

State Key Laboratory of Resource and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

School of Computer Science, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 324; https://doi.org/10.3390/info16040324

Submission received: 7 March 2025 / Revised: 2 April 2025 / Accepted: 16 April 2025 / Published: 18 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Historical figures are crucial for understanding historical processes and social changes. However, existing databases of historical figures primarily focused on ancient Chinese individuals and are limited by the simplistic organization of textual information, lacking structured processing. Therefore, this study proposes an automatic method for constructing a spatio-temporal database of modern Chinese figures. The character state transition matrix reveals the spatio-temporal evolution of historical figures, while the random walk algorithm identifies their primary migration patterns. Using historical figures from Fujian Province (1840–2009) as a case study, the results demonstrate that this method effectively constructs the spatio-temporal chain of figures, encompassing time, space, and events. The character state transition matrix indicates a fluctuating trend of state change from 1840 to 2009, initially increasing and then decreasing. By applying keyword extraction and the random walk method, this study finds that the state transitions and their causes align with the historical trends. The four-dimensional analytical framework of “character-time-space-event” established in this study holds significant value for the field of digital humanities.

Keywords:

character information; spatio-temporal data; data mining; digital humanities

1. Introduction

People are the subject of historical activity [1]. Historical figures play pivotal roles in shaping the course of history, profoundly impacting societal development through their significant actions and achievements [2]. Their lives and behaviors not only reflect individual destinies, but also epitomize the socio-cultural and political changes within specific historical contexts. In essence, historical figures serve as catalysts for change, and their decisions, actions, and interactions directly influence social evolution and the trajectory of history. Consequently, the study of historical figures—particularly the evolution of their temporal and spatial statuses—is crucial for a deeper understanding of historical development and for unraveling the socio-cultural dynamics of various eras. Through comprehensive analysis of these figures, we can discern patterns in historical evolution across time and space, thereby providing valuable empirical references insights for contemporary societal advancement.

Traditional research on historical figures has predominantly relied on static documents, biographies, and other archival sources that provide valuable historical information. However, when confronted with vast datasets on historical figures, manual analysis becomes extremely cumbersome and inefficient, particularly when the dataset encompasses millions of individuals. Under such conditions, traditional manual processing is often inadequate to satisfy the demand for precise analysis. With the rapid development of computer technology, the emergence of character databases has become imperative, as these databases facilitate the digital recording, centralized management, and systematic organization of information related to people [3]. Consequently, the challenge of effectively automating the processing of large-scale historical character data—and uncovering the inherent patterns and trends through modern big data and artificial intelligence techniques—has become a significant area of research.

Depending on the scope of the individuals included, character databases can be classified into two categories: single person databases and group person databases [4].

A single figure database is a systematic and comprehensive repository of information related to the life of a specific historical figure. The data encompass biographical details, familial relationships, academic achievements, manuscripts, honors, photographs, audio and video recordings, as well as research activities, commemorations, and other related information. These databases emphasize depth and richness in the portrayal of an individual, exemplified by resources such as the Chancellor’s Library [5].

In contrast, a group figures database focuses on the collective life experiences of historical personages within a particular temporal or spatial context, exploring the interrelationships and shared socio-cultural characteristics among them. Leveraging computer technology enables the efficient organization and management of extensive historical datasets. Representative databases of group figures databases, both domestically and internationally, include the Clergy of the Church of England Database (CCEd) [6], the Prosopography of the Byzantine World [7], the Prosopography of Anglo-Saxon England (PASE) [8], In the First Person (FIRP) [9], the China Biographical Database Project (CBDB) [10], and other databases [11].

Most of the data sources of these databases are textual information, and it is a difficult task to construct a character database based on textual information. Information Extraction (IE) is an important subfield of natural language processing, which usually involves the extraction of named entities, the relationship between named entities, and the events involved in the entities. Named Entity Recognition (NER) is one of the sub-tasks of IE, which classifies proprietary names in natural language texts into names of individuals, places, organizations, etc. [12]. For example, Chen employed data mining and entity disambiguation to identify named entities and their relationships to historical information described in narrative form [13]. However, the existing named entity identification techniques are static identification of entities, which makes the database mostly focus on static organization of the basic information of characters without constructing topological relationships of spatio-temporal continuity. For example, this includes character migration, mutual communication and cooperation, confrontation, and conflict. In order to make up for the lack of previous database studies on spatio-temporal attributes, we adopt a textual spatio-temporal quantitative computational framework, TextNet. The technique achieves the identification and linking of spatio-temporal entities based on textual context, and cuts the text into spatio-temporal scene granular events according to spatio-temporal scenarios. It achieves the alignment of time and space, and further extracts temporal entities, spatial entities, and event texts from them. The spatio-temporal changes among these characters have an important impact on the historical process, but it is often difficult for traditional databases to reflect these spatio-temporal dynamics and complexities.

The purpose of constructing a complete character database is to enable better temporal and spatial analyses of the characters. Previous scholars have tended to consider only specific groups of characters, such as youth [14] and couples [15]; factors affecting the migration of characters [16]; or specific directions of movement, such as rural to urban [17], urban to rural [18], and so on. Conventional approaches often render the analysis of character activities overly one-dimensional. This study overcomes such limitations by constructing a spatio-temporal chain dataset that encompasses all characters, thereby enriching the data dimensions and fostering a more comprehensive and systematic analytical perspective. However, directly analyzing the spatio-temporal chains of all individuals results in an excessive amount of data and complexity, making it difficult to identify key information. Therefore, this study transforms the character movement patterns from a singular chain structure into a graph structure and constructs a character movement network graph.

Although the graph structure is more concise and easier to analyze compared to the spatio-temporal chain structure, it still contains numerous insignificant edges, which hinders the extraction of the characters’ primary movement directions. Currently, many scholars have investigated methods to identify important nodes within networks [19,20,21,22] and to perform community detection [23,24,25,26] in order to pinpoint significant sections of the network. While these two approaches are capable of identifying key components to a certain extent, one focuses solely on assessing node importance, and the other concentrates on local significance rather than the global structure. Moreover, when these methods are applied to reduce networks, they often struggle to ensure that the reduced network maintains connectivity at a global level.

We have found that geometric renormalization techniques in complex networks can eliminate insignificant nodes while preserving the network’s core properties and multiscale self-similarity [27,28,29]. Notably, the classic random walk approach ensures that, even after reducing the network’s scale, its overall topological characteristics and functional modules remain intact [30]. Therefore, we employ the random walk method to extract high-fidelity subgraphs that retain the essential attributes and structure of the original graph, thereby facilitating the study of the primary movement directions in character activities.

When analyzing character activities, focusing solely on the trajectories of these activities is insufficient to reveal their intrinsic patterns; it is also necessary to consider the social contexts in which the characters operate. However, textual materials related to characters are typically lengthy and summarizing them manually is both time-consuming and labor-intensive. Therefore, the application of keyword extraction techniques to identify the most representative keywords from extensive texts not only provides a more accurate summary of the main ideas but also significantly enhances analytical efficiency.

Various methods exist for keyword extraction. For instance, simple statistical methods involve counting the occurrences of specific candidate words and ranking them based on their weights, with the top-ranked words being selected as keywords. The TF-IDF (Term Frequency–Inverse Document Frequency) algorithm is a typical representative of these simple statistical approaches. Wang Y. [31] incorporated the calculation of news headline weights into the TF-IDF algorithm based on the characteristics of online news texts to automatically extract keywords. The advantage of simple statistical methods lies in their ease of implementation and low computational cost, while their limitations are primarily characterized by insufficient scalability and lower accuracy. Graph-based keyword extraction algorithms construct a word graph from the text content, and keywords are determined by ranking the nodes within the graph. Guo et al. [32], focusing on Chinese texts, proposed a multi-feature fusion keyword extraction algorithm based on TextRank. The strength of graph-based keyword extraction models is their ability to capture the associations among words; however, despite improved accuracy, their overall performance remains suboptimal.

With the rapid development of machine learning and deep learning, new momentum has been injected into keyword extraction. The Bidirectional Encoder Representations from Transformers (BERT) model, as a pre-trained model, has achieved excellent results in keyword extraction tasks [33]. BERT has been applied to keyword extraction in scientific journals [34], product R&D documents [35], and sustainable development reports [36]. Consequently, this study adopts the ZhKeyBERT model, whose core principle is to obtain semantic embedding representations of both the document and candidate words through the BERT model, and to select the keywords that best represent the document content via cosine similarity calculations. Considering that our textual materials are entirely in Chinese, ZhKeyBERT significantly enhances extraction accuracy by reinforcing Chinese word segmentation mechanisms and semantic encoding capabilities.

In this background, this study introduces an innovative approach to construction and mining character databases. Unlike traditional static databases that merely compile and organize textual materials, this study constructs a collection of spatio-temporal data chains of figures by applying Named Entity Recognition technology to automatically identify spatial entities, temporal entities, and event contents in the text. Leveraging these spatio-temporal chains, this research efficiently detects fluctuating trends and migratory directions in characters’ state changes, while also conducting an in-depth analysis of their impact on the social progress within historical contexts. This study focuses on personages documented in local records of Fujian Province with the aim of realizing the automated construction and mining of a modern people database.

The main contributions of this paper are as follows:

(1): It transcends the limitations of traditional database by utilizing the temporal and spatial attributes of persons to construct complex spatio-temporal chains.
(2): It constructs a matrix of character state changes to identify fluctuation patterns and reveal the underlying causes.
(3): It employs a random walk algorithm to determine the primary migration directions of characters, offering new insights into the dynamic migration patterns of historical figures.
(4): It utilizes the ZhKeyBERT model to extract the keywords from time windows corresponding to historical events, thereby facilitating a deeper analysis of the factors influencing character movements.

This paper is organized as follows: Section 2 describes the dataset and the method of constructing the spatio-temporal chain. Section 3 presents the results of the character spatio-temporal chain and the state matrix. Section 4 discusses the societal context underlying the migration directions of the characters. Finally, Section 5 summarizes the main findings of this study and outlines potential directions for future research.

2. Materials and Methods

Figure 1 presents the flowchart of this study. Our research is centered on the construction of a character database, along with the in-depth exploration and analysis of character states derived from this database.

2.1. Data Sources

Local chronicles, which serve as the primary data source of this study, are comprehensive historical records organized by administrative regions. These documents provide detailed accounts of various aspects of a specific area, characterized by their authenticity, clear historical evolution and extensive coverage [37]. As a unique and valuable component of our cultural heritage [38], local chronicles were selected as the textual material for this research. The records of personages primarily consist of written accounts that focus on narrating the life stories of notable historical figures, containing rich and detailed historical information. The content covered includes an individual’s name, place of origin, birth and death dates, associated figures, time and locations of acquaintance, personal relationships, historical events and their spatiotemporal contexts, official positions, affiliations, political achievements, literary and artistic works, educational contributions, medical achievements, scientific and technological innovations, and the subsequent impacts and evaluations of these accomplishments.

The data utilized in this study are derived from the local chronicles of 60 cities, districts and counties in Fujian Province, China. Through a series of processing steps, including digital scanning, Optical Character Recognition (OCR), and manual proofreading, the textual information has been digitized, thereby providing a robust data foundation for the construction of a spatial-temporal characters database. In total, fourteen thousand biographical records from Fujian Province were collected in this study.

2.2. Study Area

Fujian province is situated on the southeast coast of China, across the strait from Taiwan province, and its distinctive geographic setting—characterized by both mountains and sea—has profoundly shaped its unique humanistic identity. As one of the first provinces in modern China to open to international engagement, Fujian emerged as a commercial hub (e.g., Fuzhou and Xiamen) in the mid-19th century. Its early exposure to Western civilization laid the foundation for ideological clashes and significant social transformation. A multitude of figures from Fujian have played pivotal roles in China’s modernization across political, military, cultural, educational, and economic domains. These individuals have made indispensable contributions to modern Chinese history, whether by promoting ideological enlightenment, participating in revolutionary endeavors, spearheading industrial initiatives to revitalize the nation, or bridging Chinese and Western cultural paradigms.

2.3. Construction of Spatio-Temporal Chain

Although we recorded all textual data concerning the characters in the database, it is not so convenient to employ these texts directly for spatio-temporal analysis due to the presence of redundant information. This redundancy poses a challenge for constructing the people database. A character’s life experience can be conceptualized as a series of discrete spatial and temporal events, each defined by a time, location, and event content, arranged in chronological order. This approach transforms complex textual data into an analyzable format. Therefore, this paper uses TextNet [39,40,41], a computational framework for textual spatio-temporal quantization. This technique performs spatio-temporal entity recognition and linking based on textual context, segmenting texts into spatio-temporal events. Subsequently, temporal entities, spatial entities and event texts are extracted from these segments, as shown in Figure 2. We evaluated the recognition accuracy of TextNet by manually extracting 1000 samples. The results indicate that the recognition accuracy of temporal entities is 90.8%. Analyzing the misrecognition cases reveals that the primary cause lies in the variations in time recording methods across different historical periods. Although most temporal entities from different eras were correctly identified, a small number of errors still occurred. For example, “Republic of China Year 13 (1924)” was incorrectly recognized as “0013”. These errors typically fall outside our defined research time range, allowing us to mitigate most misrecognitions during the experiment. In the evaluation of spatial entity recognition, after removing 33 samples in which the text did not explicitly specify a location, the recognition accuracy was found to be 86.5%, slightly lower than that of temporal entities. Further analysis indicates that the main reason for misrecognition is the ambiguity or omission of key information in textual descriptions when individuals move between locations, making accurate location identification more challenging. Therefore, in the subsequent processing of spatial entities, we preserve location status by selecting the location with the highest temporal resolution, thereby ensuring the completeness and consistency of spatial information.

A character spatio-temporal chain represents the sequence of a person’s migration or activity path across different spatio-temporal nodes. It creates a coherent, unilinear trajectory by treating each time point as a node and sequentially linking on the character’s activities over time. Each node typically carries state information, such as geographic location and event details, thereby illustrating the evolution of characters over time and space.

Despite the capability of TextNet to automatically identify temporal, spatial, and event entities from a person’s life, issues persist in the resulting set of spatio-temporal incidents, largely due to inherent limitations in the original records. First, local chronicles record characters’ life experiences with varying temporal granularities, leading to potential overlaps or nested relationships between time periods corresponding to different event scenarios for the same character. Such discrepancies hinder the construction of coherent character spatio-temporal chains. Second, local historical records occasionally document multiple locations for a character within the same time period for brevity, resulting in a one-to-many relationship between a time period and spatial locations. Conceptually, the character’s trajectory should from a continuous, single chain; however, these issues can lead to overlapping and branching spatio-temporal chains.

To address these challenges, this study proposes a method for generating character behavior time series, along with a method for processing character spatial location data.

2.3.1. Character Behavior Time Series

From the definition of a character’s spatio-temporal chain, it follows that a character’s spatio-temporal trajectory should be linear, with no overlap or containment between times intervals. In addition, since local chronicles generally record only significant historical events in an individual’s life, certain periods inevitably remain unrecorded; these gaps should be distinctly labeled for proper differentiation.

In this study, all scene data pertaining to an individual are considered as a collection of data, with the interval between the start and end nodes of a given period representing a consistent state. Given that overlapping or nested relationships may exist among time periods, it is necessary to leverage these inter-period relationships to refine the segmentation of broad temporal intervals. The specific processing method is outlined as follows:

Suppose we have

n

initial time periods

T_{1}, T_{2}, \dots, T_{n}

, where each

T_{i}

denotes a specific time interval. Let

T_{i} = [T_{i 1}, T_{i 2}]

represent the start time and end time of the period

T_{i}

. The initial set of time periods may exhibit overlapping, nested, or blank relationships.

Firstly, the time nodes (including start and end nodes) of all time periods are sorted in ascending order to obtain a new time node sequence:

T = \{t_{1}, t_{2}, \dots, t_{m}\}, m = 2 n - 1

(1)

where

m

denotes the total number of new time nodes. For each pair of consecutive time nodes

(t_{i}, t_{i + 1})

, we define a new time period as follows:

T_{i} = [t_{i}, t_{i + 1}], i = 1,2, \dots, m - 1

(2)

In this manner, the original

n

time periods are refined into

m - 1

new time periods by combining adjacent time nodes.

Valid time slot: if time node

t_{i}

is in a valid time range (i.e., not a blank period), its state is determined by the state(s) of that period. In the event that multiple states

{S_{1}, S_{2}, \dots, S_{k}}

are present within the period, the state with the smallest temporal resolution is selected:

S (t_{i}) = m i n (S_{1}, S_{2}, \dots, S_{k})

(3)

This node is then labeled as “node” to indicate a valid time slot.

Blank time slot: If the time node

t_{i}

falls within a blank period, its state is inherited from the state

S (t_{i - 1})

of the previous valid time slot and is labeled as “unknown”:

S (t_{i}) = S (t_{i - 1})

(4)

By applying the above method, the original

n

time periods

T_{1}, T_{2}, \dots, T_{n}

are refined into

m - 1

new time slots, denoted as:

T_{1} = [t_{1}, t_{2}], T_{2} = [t_{2}, t_{3}], \dots, T_{m - 1} = [t_{m - 1}, t_{m - 2}]

(5)

These new time periods consist of time slots that include both valid data and blank intervals. It is important to note if two time periods share identical start and end times, such data is redundant and should be removed and forming analysis.

As shown in Figure 3, by arranging all the time nodes in ascending order and forming new time slots from adjacent time nodes, we are able to refine the segmentation of the original time intervals. The state update rules applied to each new time node ensures accurate an accurate representation and high-precision analysis of the spatio-temporal data.

2.3.2. Processing of Character Spatial Location

As noted earlier, the spatial locations of characters in local chronicles are not recorded in a one-to-one correspondence with time, meaning that characters may appear in multiple spatial locations within the same time range. This phenomenon results in branching within the spatio-temporal chain of figures. To address this issue, we assume that the order in which spatial locations are recorded corresponds to the actual sequence of the characters’ movements; that is, the first recorded location is the first location visited by the character.

For time slots that encompass multiple spatial locations, we proceed as follows:

Valid time slot: For a non-blank period, suppose multiple spatial locations occur, denoted as

P = {P_{1}, P_{2}, \dots, P_{m}}

, where

m

is the number of spatial locations. Given that there are several spatial locations within the time span, we partition the period evenly according to the number of spatial locations and the chronological order in which they appear.

Assuming that a character experiences

m

spatial locations during the time period

T = [t_{a}, t_{b}]

, the averaged duration for each sub-period

∆ T

is computed as:

∆ T = \frac{t_{b} - t_{a}}{m}

(6)

Accordingly, the period

T

is divided into

m

sub-time segments

T_{1}, T_{2}, \dots, T_{m}

, with each sub-time segment corresponding to a spatial location. Specifically, the time slots during which the character is at the spatial location

P_{1}

are:

T_{1} = [t_{a}, t_{a} + ∆ T]

(7)

Similarly, the time slot for spatial location

P_{2}

is:

T_{2} = [t_{a} + ∆ T, t_{a} + 2 * ∆ T]

(8)

Continuing this reasoning, the character is at spatial location

p_{m}

during the time slot:

T_{m} = [t_{a} + (m - 1) * ∆ T, t_{b}]

(9)

Thus, by evenly dividing the time periods

T = [t_{a}, t_{b}]

into sub-periods, we refine the original interval into multiple segments, each assigned to the corresponding spatial location based on its chronological order.

Blank time slot: For a blank time span, if the inherited state comprises multiple spatial location, the final occurrence of the spatial position is selected as the positional state for that time slot, based on the order of appearance of the spatial locations. Specifically, assume that the time period

T_{n u l l} = [t_{x}, t_{y}]

represents a blank time span and the inherited spatial position sequence is

P = {P_{1}, P_{2}, \dots, P_{m}}

. Then, the position state for the blank time period is defined as

S (T_{n u l l}) = P_{m}

(10)

This means that the state of the blank period is inherited from the last spatial position recorded.

By applying the above processing methods, we can address the issue of personas appearing in multiple spatial locations within the same time span. In this framework, the state of a blank time slot is inherited from the last recorded spatial position whereas a non-blank time slot is evenly divided according to the number of spatial positions and their chronological. This methodology ensures that the spatio-temporal chain of characters remains consistent and accurately reflects the actual movement of the characters.

Through the processing of time and space described above, the spatio-temporal chains of the characters can be successfully derived, as shown in Figure 4.

2.4. Spatiotemporal State Changes of Characters

2.4.1. Character State Change Matrix

Social transformation is a continuous process, and people—being a critical dimension of social development—effectively reflect social change and historical trends through their evolving status. Given that this study focuses mainly on modern history, the temporal scope is defined from 1840 to 2009.

The persona state change matrix is constructed as follows:

First, each year within this study’s time interval is represented along the horizontal axis denoted ass

t_{j} \in {1840, 1841, \dots, 2009}

, encompassing a total of 170 years. The vertical axis represents each character, with character dimension denoted as

p_{j} \in {1, 2, \dots, N}

, where

N

is the total number of characters, and each character is assigned a unique identifier, as shown in Figure 5.

Each element

M_{i j}

in the matrix denotes the number of historical events in which person

p_{i}

participated in year

t_{j}

, thereby representing the frequency of the person’s activities during that year. Let

{E v e n t (P}_{i}, t_{j}

) denote the set of all historical incidents involving person

p_{i}

in year

t_{j}

; then the matrix element

M_{i j}

is computed:

M_{i j} = | {E v e n t (P}_{i}, t_{j}) |

(11)

where

| {E v e n t (P}_{i}, t_{j}) |

indicates the number of historical events in which person

p_{i}

participated in year

t_{j}

.

By constructing the character state change matrix

M

, we can analyze the activities of each character across different time periods, thereby revealing the trend of their activities in relation to broader social changes.

Utilizing the character state change matrix constructed from the database, we describe the overall pattern of changes in character states. To further understand the underlying causes of these changes, we conduct an in-depth analysis from two perspectives. First, we examine the character’s spatio-temporal chain data to analyze their movement directions over time and space. Second, we analyze the social context by applying keyword extraction technology to extract principal keywords, thereby elucidating the deeper factors driving the movement of characters.

2.4.2. Main Direction of Character Movement

As mentioned in the introduction, previous studies on character migration have often focused on specific groups of individuals or particular migration directions. To overcome this limitation, a more comprehensive approach to analyzing character movement is needed. In this study, we utilize an independently constructed spatio-temporal chain dataset of historical figures, which significantly expands the volume of data and enhances the comprehensiveness of the analysis. However, if each character is examined separately—meaning all spatio-temporal chains are visualized on a single graph—the excessive data volume results in a cluttered and disorganized representation, making it challenging to extract meaningful information.

To address this issue, we transform the character movement representation from a linear chain structure to a graph-based structure, referred to as the character movement network or character migration network. In this network, locations serve as nodes, and different administrative levels (e.g., provincial, city, or county) can be selected based on the historical event under study. Evidently, within a given historical event’s time window, the number of migrations between two locations plays a crucial role in determining migration direction. Therefore, we define the edges in the graph as the number of movements between two locations, and these edges are represented as bidirectional to reflect the intensity of character movement.

While the graph-based structure provides more concise and interpretable representation than individual spatio-temporal chains, it still contains numerous insignificant nodes and edges that obscure the primary migration patterns. To enhance the network while preserving its structural integrity and key attributes, we apply geometric renormalization techniques from complex network analysis to remove less significant nodes and edges. An effective strategy for this process is the application of random walk, which enable us to filter out inconsequential components while retaining the most critical structural elements. By implementing this method, we ensure that the most significant regions of the migration network remain intact, without compromising overall connectivity, thereby improving the clarity and interpretability of character movement patterns, as illustrated in Figure 6.

In our experiments, the edge weights play a crucial role in recognizing the direction of character migration. Iacopini [42] introduced an edge-enhanced random walk approach, where the network possesses a specified topology in which edge weights reflect the strength of associations between concepts. His model incorporates a reinforcement mechanism that evolves over time, such that each traversal of an edge increases its weight, making subsequent traversals more likely.

It is important to note that the updating of weights during the random walk is solely intended for computing transfer probabilities and is not involved in the final subgraph extraction process.

The network’s re-matrix at time t is denoted as

W^{t} = {w_{i j}^{t}}

, where

w_{i j}^{t}

quantifies the strength of the relationship between nodes

i

and

j

at that moment

t

. Based on the weight priority mechanism, if the random walker is currently at node

i

at time

t

, the probability of transitioning to a transitioning node

j

is given by:

P^{t} (i \to j) = \frac{w_{i j}^{t}}{\sum_{l \in N_{i}} w_{i j}^{t}}

(12)

where

N_{i}

denotes the set of adjacent to node

i

. During the random walk, the weights of the edges traversed are updated according to:

w_{i j}^{t + 1} = w_{i j}^{t} + δ_{w}

(13)

where

δ_{w}

is the edge weight reinforcement parameter.

The following algorithm provides an overview of the method for extracting the primary movement direction of a character using a random walk model. (1) Given an initial weighted network

G_{0}

with

N_{0}

nodes and

M_{0}

edges, where each edge has an associated weight. Set the edge weight reinforcement parameter

δ_{w}

, the time step

T

for the random walk, and the desired number of nodes

N_{s}

in the target subgraph. (2) Select the top

m

nodes with the highest degree centrality in

G_{0}

as candidate starting points for the random walk, store these nodes in the set

X

. (3) Finally, the subgraph formed by the collection of distinct nodes and edges traversed during the random walk is then considered a refined subgraph of the original network.

In our experiments, we set the number of candidate starting nodes to 10 in accordance with [30], meaning that the top ten nodes with the highest degree centrality in the initial network are selected for the random walk. The edge weight reinforcement parameter was set to

δ_{w} = 0.01

.

2.4.3. Keyword Extraction

KeyBERT is an unsupervised keyword extraction technique that obtains semantic embedding representations of both the document and its candidate words via the BERT model, subsequently filtering the keywords that best represent the document’s content using cosine similarity. Specifically, the method encodes the entire document as a high-dimensional vector using BERT, while also generating vector representations for individual words or phrases in the document. The cosine similarity between the candidate word vectors and the document vectors is then calculated, and the candidate with the highest similarity is selected as the keyword. Designed for efficiency and ease of use, KeyBERT is particularly well-suited for the rapid processing of large-scale text.

From a technical standpoint, KeyBERT operates in three stages. First, document-level embedding vectors and candidate word embedding vectors are generated based on the BERT model. The quality of document embedding directly influences the semantic representation of the extracted keywords. Second, candidate words are extracted using an n-gram strategy or customized rules to ensure that the core vocabulary. Finally, the cosine similarity between the document vector

D

and each candidate word vector

C_{i}

is computed using the following formula:

S i m i l a r i t y = \frac{D \cdot C_{i}}{∥ D ∥ ∥ C_{i} ∥}

(14)

The semantic matching process outputs keywords ranked according to their similarity score.

While KeyBERT has proven effective for keyword extraction in English text, its direct application to Chinese texts often leads to challenges due to structural differences between the two language systems, such as semantic characterization biases and issues with word segmentation. To address these challenges, this study employs the ZhKeyBERT model, an optimized variant b of KeyBERT. By strengthening the Chinese word segmentation mechanism and semantic encoding capabilities, ZhKeyBERT significantly improves extraction accuracy and effectively mitigates the semantic distortions associated with cross-language migration. Here, we set the extraction to allow only the retrieval of single words.

3. Results

By the method carefully described above, we obtained the following results.

3.1. Automatic Construction and Querying of Spatio-Temporal Chains

By converting textual materials into character spatio-temporal chain data, we can automatically reconstruct the movement trajectories of historical figures. Specifically, using data from Fujian Province, we constructed 107,754 spatio-temporal chains of 5371 individuals, and developed an open web platform that enables users to query for character biographical data alongside their spatio-temporal chains (see Figure 7). In the character query interface, the character’s name is prominently displayed in the upper left corner, with the biography presented immediately below. Further down, detailed event data are provided, including the start time and end times of events as well as the locations where these events occurred. Additionally, the upper right section of the interface features an automatically generated roadmap illustrating the character’s spatial and temporal trajectory, with relevant placements clearly labeled. Users can zoom in and out using their mouse, with different scales revealing varying levels of locational detail. By zooming in, users are able to pinpoint precise spatial positions, thereby offering a clear depiction of the trajectories of individual activities and the distribution of events.

3.2. Results of Character State Changes

Through the character state research methodology described in the previous section, we constructed a character state change matrix. By aggregating the data for each year from the matrix, we derived the total level of character activity per year. Based on this data, we generated graphs to visualize the temporal trend in characters’ activities (see Figure 8). The graph illustrates both the intensity of the character activities and the fluctuations in their status over different periods, thereby providing a quantitative basis for understanding the changes in the frequency of characters’ participation in historical events.

The results reveal that the state fluctuations of modern figures in Fujian Province follow an initial upward trend followed by a decline, with overall strong volatility and notable peaks during specific periods—particularly between 1910–1913, 1924–1952, and 1977–1980.

During 1910–1913, character activity fluctuations were especially pronounced, displaying an overall upward trend that peaked in 1913, which reflects a significant surge in activity during that time. Although fluctuations decreased slightly thereafter, the level of activity remained high. The period from 1924 to 1952 is marked by substantial volatility, frequent state changes, and large fluctuations, indicating highly dynamic and unstable character activity during this interval. Between 1950 and 1958, character activity initially increased before experiencing a slight decline; notably, in 1956–1957, activity levels peaked and then dipped, reflecting short-term fluctuations in intensity. The period from 1977 to 1980 saw another increase in fluctuations, with a slight rise in activity frequency in 1979 and 1980, leading to a short-term peak of moderate fluctuations. Thereafter, the fluctuations gradually stabilized, with no new significant new peaks, and an overall tendency toward a decline.

Furthermore the analysis shows two long periods of stability in character state fluctuations—from 1840 to 1910 and from 1960 to 2009. In both intervals, only minor fluctuations are observed, and the overall state changes tend to stabilize.

4. Discussion

Next, we proceed to conduct an in-depth analysis of the character movement direction and the underlying causes based on the designated time window.

4.1. Verification of Random Walk

In Section 4.2, we demonstrated how to employ the random walk method for network reduction while preserving the network’s key characteristics and connectivity. To assess the validity of this approach for our character migration network, we followed the methodology described in [43] and verified whether the resulting subgraphs retain the primary structure of the original network and maintain network connectivity, using three specific metrics. The results are presented in Figure 9.

These three metrics pertain to the weighted properties of nodes and edges in a network. By “rescaling” the original node strengths, unit strengths, and edge weights, we eliminated the effects of network size and reveal their global distributional characteristics through complementary cumulative distributions. The complementary cumulative distributions

p_{c} (s_{r e s}^{(l)})

,

p_{c} (U_{r e s}^{(l)})

, and

p_{c} (w_{r e s}^{(l)})

for different network layers exhibit a self-similarity feature.

These findings indicate that this random walk approach employed for network approximation successfully preserves the key properties of real weighted networks. Consequently, it can be effectively used to remove insignificant nodes and edges from our character migration network while ensuring network connectivity remains intact.

4.2. In-Depth Analysis of Character State Changes

Following the analysis above, our findings confirm that the random walk method can be effectively employed to simplify the network diagram of character movement, thereby revealing both the primary and secondary migration directions. As discussed in Section 3.2, the state fluctuations of modern figures in Fujian Province exhibit several pronounced peaks during specific historical periods while remaining relatively stable during other intervals. In the subsequent sections, we begin by briefly examining the stable plateau period, followed by an analysis of periods by sharp state fluctuations based on important historical time windows.

4.2.1. Stable Period

The stable periods can be categorized into two distinct time intervals: 1840–1910 and 1960–2009. In the following sections, we analyze each period individually.

(1): 1840–1910

During this period, the character state of Fujian is relatively stable, with a slow upward trend, as shown in Figure 10. Population mobility in Fujian was primarily restricted to neighboring coastal provinces and islands, including Zhejiang, Guangdong, Guangxi, Taiwan, and Hong Kong, with no significant inland or long-distance migration. The migration pattern was characterized by a “coastal corridor”, demonstrating weak interaction with China’s inland provinces.

Fujian’s topography, which is generally higher in the northwest and lower in the southeast, is dominated by mountainous and hilly terrain with limited plain [44]. The Wuyi Mountains acted as a natural barrier, restricting connection with inland provinces such as Jiangxi. Additionally, the overall transportation infrastructure remained underdeveloped, compelling population movements to follow waterways toward the coast. The population distribution in Fujian also exhibited a pattern of high density along the coast and sparse settlement in the inland areas [45]. Notably, the opening of Fuzhou and Xiamen as commercial ports following the Treaty of Nanjing in 1842 further reinforced the coastal shipping network [46], facilitating maritime trade and migration.

(2): 1960–2009

During this period, the overall state change of historical figures in Fujian remained relatively stable. However, the implementation of reform and opening-up policy in 1978 led to a slight increase in state fluctuations. During this time, three primary directions emerged for figures from Fujian: (1) Guangdong, (2) Beijing and Tianjin, and (3) Jiangsu, Anhui, and Shanghai. With the introduction of the reform and opening-up policy in 1978, Guangdong became the first region in China to establish special economic zones, with Shenzhen’s founding marking the rapid economic rise of the province (Figure 11). The explosive growth of manufacturing industries in the Pearl River Delta (PRD) created a massive demand for labor, attracting large numbers of migrants from Fujian seeking better employment opportunities in Guangdong and Hong Kong. Additionally, as the capital of China, Beijing exerted great appeal in politics, culture, education, and scientific research. The resumption of the college entrance examination in 1977 further intensified migration, as a large influx of students from Fujian sought higher education opportunities in Beijing’s universities. Meanwhile, as the economic, financial, and cultural hub, Shanghai experienced unprecedented development following the reform and opening-up period. The rapid acceleration of industrialization and modernization in Shanghai attracted a significant labor force from Fujian, further reinforcing migration patterns toward China’s eastern coastal cities [47].

4.2.2. Volatile Period

The migration patterns of historical characters can be more effectively analyzed within the broader social and historical context of their time. Therefore, in the following sections, we choose several key historical events time windows for a more detailed analysis. Due to the significantly higher migration intensity between Fujian and Guangzhou during 1910–1913 compared to other regions, we set the administrative level of network nodes to the city level to more precisely investigate the primary migration directions. In contrast, for the other three time windows—1924–1927, 1937–1945, and 1945–1949—which reflect large-scale national migration patterns, we set the administrative level of network nodes to the provincial level.

(1): 1910–1913

As shown in Figure 12, the most frequent movement of historical figures occurs between Fuzhou and Guangzhou. In addition, there are notable movements between Fuzhou and other cities whin Fujian Province, as well as with Wuhan, Beijing, and Shanghai. Beyond the migration directions, the analysis of characteristic keywords provides deeper insights into the social context of this period. Notably, terms such as ‘uprising’ and ‘government’ appear frequently, as illustrated in Table 1.

The Guangzhou Uprising (April 1911), also known as the Huanghuagang Uprising [48] and the Wuchang Uprising (10 October 1911). were two pivotal events during the Xinhai Revolution, a nationwide movement aimed at overthrowing the autocratic imperial system of the Qing Dynasty and establishing a republican form of government [49]. This revolution exerted a profound impact on society, as the resultant uprisings, armed conflicts, and political upheavals across various regions spurred significant population movements. Revolutionaries in Fujian played an active role in the Chinese Revolution by supporting these uprisings and disseminating revolutionary ideas to other regions [50].

During the Xinhai Revolution, movements between Fujian and Guangzhou became particularly pronounced. Guangzhou, as a major center of revolutionary activity, served not only as the epicenter of revolutionary operations but also as a hub for Fujianese revolutionaries. Many individuals from Fujian traveled between Guangzhou and their home province to participate in revolutionary activities. Notably, during the Huanghuagang Uprising, Fujian revolutionary forces such as the Alliance Association were instrumental, maintaining close cooperation with other revolutionary forces in Guangzhou and t collectively advancing the revolution’s objectives [51].

Following the Xinhai Revolution, gradual consolidation of power by the Northern Warlords had a significant impact on Fujian [52]. These warlords exerted control by appointing local officials and military commanders, leading to substantial changes in the local power structure. Consequently, many local figures experienced rapid rises or declines in influence based on their affiliations with the Northern Warlords [53].

(2): 1924–1927

During this period, Fujian Province significantly expanded its connections with other provinces, leading to an increase in the movement of figures between these regions. In addition to Guangdong, both Shanghai and Beijing engaged more frequent with Fujian during this period. A portion of the migration from Fujian directed northward toward Beijing, Hebei, and Tianjin, while another portion moved to Zhejiang, Anhui, and Shanghai. The remaining flow was toward central China, predominantly to Jiangxi and Hebei, as illustrated in Figure 13.

As indicated in Table 2, the term ‘uprising’ continues to be frequently mentioned during this stage. The Nanchang Uprising, an armed rebellion that took place on 1 August 1927, in Nanchang, Jiangxi Province, was led by the Communist Party of China and involved forces from the National Revolutionary Army (NRA) [54]. The presence of guerrilla zones and units underscores the operational characteristics of the Communist Party The NAR, established in in June 1924 as an adaptation of the National Government’s army, played a central role during this period. From January 1924 to July 1927, the First Domestic Revolutionary War was waged under the joint leadership of the Communist Party of China (CPC) and Kuomintang (KMT) against imperialist forces and the Northern Warlords. In July 1926, the National Revolutionary Army embarked on the Northern Expedition, and on 9 December 1926, successfully entered Fuzhou [55,56]. With the progression of the first Nationalist–Communist cooperation, Fujian’s strategic position was gradually reinforced, establishing it as an important center for revolutionary activities and population movements.

During this period, movement from Fujian Province to the northern regions—namely Beijing, Hebei, and Tianjin—was particularly pronounced, reflecting the nationwide expansion of revolutionary activities. As the alliance between the Kuomintang and the Communist Party progressed, revolutionary forces in Fujian gradually strengthened their connections with these northern regions, which emerged as key centers of revolutionary activity, with Beijing, in particular, attracting numerous revolutionaries and supporters due to its political and cultural significance [57].

In addition, Jiangsu and Zhejiang, serving as the economic, cultural, and political centers of the time, attracted a substantial number of revolutionaries and progressive individuals from Fujian. Notably, Shanghai functioned not only as an important base for KMT–Communist cooperation but also as a center of revolutionary activities, owing to its geographical location and political importance. For example, on 22 May 1925, Fang Erhao accompanied Ma Nianyi to Shanghai to report on the situation to the Central Committee of the Communist Youth League, and, two months later, he returned to Fuzhou to further develop the organization [58].

Furthermore, migration from Fujian was not limited solely to the northward and eastward movements; significant flows were also directed toward Guangdong, Jiangxi, and Hubei. Guangdong, recognized as the birthplace of the National Government during this period, attracted numerous revolutionary forces, and the movement between Fujian and Guangdong underscored them close political and revolutionary cooperation between these regions. In contrast, Jiangxi and Hubei evolved into critical revolutionary bases, with Fujian’s growing ties to these two provinces reflecting the broader mobilization of revolutionary forces and the increasing need for support. For example, after the “4.12 Incident” in 1927, Chen Ming traveled to Wuhan to report on the situation in Fujian to the Central Committee of the Communist Party of China. In August, serving as the Party Affairs Special Commissioner for Fujian on behalf of the Central Committee, he returned to Xiamen, where he covertly established a liaison station, reconnected with dispersed party members, organized peasant uprisings, and formed peasant armed forces [59].

(3): 1937–1945

During the 1937–1945 War of Resistance against Japan, the predominant trend in population mobility exhibited a migration pattern centered on Fujian (see Figure 14). As a strategically significant region, Fujian not only served as a refuge for numerous displaced individuals and resistance forces from across the country but also acted as a primary point of departure for migration. From Fujian, population flow primarily moved northward, sequentially passing through Zhejiang and Anhui, and subsequently reaching Jiangsu, Shanghai, Shandong, and Beijing [60].

As indicated in Table 3, the dominant keywords during this period are “army” and “military region”, reflecting a society engulfed in warfare. The primary forces of confrontation were the Chinese People’s Liberation Army and the field army. A significant number of figures from Fujian joined the East China Field Army—a formation led by the Communist Party of China (CPC) that operated in East China, covering regions such as Shandong, Jiangsu, Henan, and most of Anhui [61]. The East China Field Army primarily originated from the National Revolutionary Army’s Newly Organized Fourth Army. Among them, soldiers from Fujian totaled nearly 5000, accounting for almost half of the total forces when the New Fourth Army was initially formed [62]. Specifically, the key areas of activity for Fujian characters were the Beijing Military Region, the Fuzhou Military Region, and the Guangxi Military Region.

Overall, Fujian, as the center hub of mobility, exhibited a trend of multi-directional expansion—including northward, southward, and westward movements—and actively engaged in anti-Japanese guerrilla warfare, reflecting its critical strategic position during the War of Resistance [63,64]. In 1937, the Chinese People’s War of Resistance against Japanese Aggression erupted in full force, rendering Fujian one of the key frontlines. The Japanese army’s repeated attacks on Fujian led to the mass displacement of its inhabitants. For instance, on 10 May 1938, following the Japanese occupation of Xiamen, the city’s population plummeted from approximately 180,000 to just 13,000 as a large number of citizens fled [65]. Subsequent assaults on Fuzhou and other locations further forced many Fujianese to relocate to the mainland or to other provinces to in order to escape the conflict.

(4): 1945–1949

During the period 1945–1949, the pattern of personnel mobility exhibited characteristics from those observed during the War of Resistance against Japan, particularly in the context of a shifting political and military landscape. During the Liberation War, although Fujian remained a major hub for population mobility, both the direction and volume of mobility underwent notable changes. As illustrated in Figure 15, Fujian continued to serve as the core region mobility, giving rise to migration flows toward the north, south, and east.

In 1945, following the victory in the Chinese People’s War of Resistance Against Japanese Aggression, the civil conflict between the Kuomintang and the Communist Party resumed. The period from 1945 to 1949, is known as the War of Liberation of the Chinese People, saw Fujian emerge as one of the most contested regions between these factions, and it became a primary area of activity for Fujianese figures, as indicated in Table 4. During this period, the Chinese Democratic League and the Central Army played crucial roles. From 21 to 30 September 1949, the First Plenary Session of the Chinese People’s Political Consultative Conference (CPPCC) was convened, successfully fulfilling the historical mission of establishing a new China and ushering in a new era in Chinese history [66].

Notably, the movement of individuals between Fujian and Shanghai, as well as between Fujian and Beijing, was more pronounced, reflecting a movement toward major political and economic centers at that time. This trend was closely linked to the evolving political dynamics of the Liberation War, especially as the Communist Party of China gradually consolidated its control over northern and eastern China, thereby contributing to the formation of a new political landscape. In the early years of China, the Yangtze River Column’s southbound cadres played a crucial role in the takeover of Fujian, participating in the administration of six prefectural regions [67].

Moreover, mobility between Fujian and Guangdong, Hong Kong, and Taiwan remained active during this period, although the nature of mobility underwent some changes. The increase migration from Fujian to Taiwan may be attributed to shifts in cross-strait political relations in the post-war period.

5. Conclusions

In response to the limitations in the construction and analysis of historical figure databases, this study proposes an innovative method for the automated construction of character databases and the mining of spatio-temporal associations. By constructing a collection of spatio-temporal chains for individuals, the dynamic representation of temporal attributes, spatial attributes, and event associations is achieved, thereby transcending the static data organization models of traditional databases. An empirical study based on Fujian local chronicles demonstrates that, first, the proposed method effectively identifies spatio-temporal entities within texts and generates character spatio-temporal chains, providing a structured data foundation for analyzing interactions between historical figures and events. Second, by constructing a character state change matrix, the overall trends and fluctuations in the activities of modern historical figures in Fujian Province are revealed. The results indicate that the state of historical figures in Fujian generally follows an upward trend followed by a decline, with two stable periods (1810–1910 and 1960–2009) and four distinct fluctuation periods—namely, 1910–1913, 1924–1927, 1937–1945, and 1945–1949. Notably, these fluctuation periods precisely coincide with major historical events such as the Xinhai Revolution, the First Chinese Revolutionary Civil War, the Chinese People’s War of Resistance Against Japanese Aggression, and the Liberation War.

Subsequently, the primary migration directions of historical figures within the time windows of these events are analyzed in detail, and keyword extraction techniques, combined with the contemporaneous social context, are employed to examine the underlying causes of the observed changes in character states. The experimental results reveal that modern historical figures exhibit both stable periods and periods of intense fluctuation. During the stable periods from 1840 to 1910 and from 1960 to 2009, the population of Fujian primarily migrated along a “coastal corridor” toward neighboring coastal provinces; however, following the reform and opening-up, population flows shifted toward Guangdong, the Beijing–Tianjin region, and the Yangtze River Delta. Between 1910 and 1913, the frequent exchanges between Fujian and Guangzhou reflect the impact of the Xinhai Revolution—for example, among the 72 martyrs of Huanghuagang, 19 were from Fujian, including figures such as Lin Juemin, Fang Shengdong, Feng Chaoxiang, and Luo Nailin. Between 1924 and 1927, strengthened ties between Fujian, northern regions, and Shanghai signified the nationwide expansion of revolutionary activities. During the period 1937–1945, Fujian emerged as a critical hub of population movement, with a predominant migration toward the north. Notably, in the spring of 1938, based on an agreement reached during negotiations between the Kuomintang and the Communist Party, Red Army guerrilla units in 14 guerrilla zones (excluding Qiongyao) across eight southern provinces were reorganized into the Fourth New Army of the National Revolutionary Army and moved northward to resist Japanese forces. Between 1945 and 1949, during the Liberation War, migration directions in Fujian became even more diverse, reflecting a new political landscape and social turbulence. Some military and political cadres of Fujian, such as Li Liangrong and Dai Zhongyu, opted to retreat to Taiwan, serving as typical cases of bifurcation between the Kuomintang and the Communist Party.

This study adopts a data-driven approach, utilizing government local chronicles to construct a modern historical figure database. This enables the quantitative analysis of migration patterns of historical figures across temporal, spatial, and social dimensions, thereby providing robust quantitative support for social science research. By comparing the time windows of major historical events, this study validates the profound impact of these events on population movement and social structure. Furthermore, the multidimensional cross-analysis tool constructed herein not only compensates for the limitations of singular data sources but also enriches the interpretation of historical contexts. More importantly, the proposed method transforms unstructured historical texts into standardized spatio-temporal data chains. Regardless of the historical period or region, as long as there exist textual data describing individuals, events, times, and locations, the same database construction methodology can be applied to achieve standardized data and unified management. By employing a unified temporal axis converter and a fuzzy matching algorithm, discrepancies arising from different chronological systems and variations between ancient and modern place names are resolved, thus enabling cross-temporal and cross-regional data integration and comparison, and providing a solid data foundation for revealing the migration patterns of historical figures and social transformations under varying contexts.

Although this study introduces a novel methodological framework for the spatio-temporal analysis of historical figures in within the digital humanities, several limitations remain need to be confronted. Currently, our data sources primarily rely on local chronicles. Although these chronicles contain extensive information on historical figures and support the construction of a historical figure database, future work can enhance the completeness of spatio-temporal chains and the precision of spatio-temporal analyses by integrating heterogeneous textual sources such as biographies, newspaper and periodical literature, and publicly available online character data. This integration will facilitate the development of cross-media data verification and spatio-temporal correlation models. Furthermore, building upon our work in Fujian, we plan to conduct comparative analysis with other regions to explore the commonalities and differences in the migration patterns of historical figures across various geographical units. Additionally, we intend to investigate the integration of a transnational historical figures database and gradually establish an analytical framework that spans from the national to the global scale. By comparing the spatial and temporal characteristics of human movement of people across civilized areas, we seek to deepen our theoretical knowledge understanding of the principles of human geography in the context of globalization. We aim to exploit these frontiers in our future research endeavors.

Author Contributions

Conceptualization, X.J. and W.Y. (Wen Yuan); methodology, X.J. and W.Y. (Wen Yuan); software, X.J. and W.Y. (Wu Yuan); validation, X.J.; formal analysis, X.J.; investigation, X.J. and R.W.; resources, X.J. and X.G.; data curation, X.J. and X.G.; writing—original draft preparation, X.J.; writing—review and editing, X.J. and W.Y. (Wen Yuan); visualization, X.J. and R.W.; supervision, W.Y. (Wu Yuan); project administration, W.Y. (Wen Yuan); funding acquisition, W.Y. (Wen Yuan). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2022YFF0711601) and the Strategic Pilot Research Program of the Chinese Academy of Sciences (XDA23100103).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CCEd	Clergy of the Church of England Database
PASE	The Prosopography of Anglo-Saxon England
FIRP	T In the First Person
CBDB	China Biographical Database Project
OCR	Optical Character Recognition
PRD	the Pearl River Delta
CPC	the Communist Party of China
KMT	Kuomintang of China
NRA	the National Revolutionary Army
CPLA	Chinese People’s Liberation Army
CPPCC	the Chinese People’s Political Consultative Conference

References

Zhang, H.P.; Geng, Y.Z.; Zheng, S.Q.; Zhu, Y. Evaluating Individuals in Modern Chinese History through Historical Materialism. Hist. Bimon. 2020, 6, 4–29. [Google Scholar]
Xiong, Y.H.; Yu, L.P. Evaluation of Historical Figures in the Perspective of Historical Materialism. J. Hubei Eng. Univ. 2020, 40, 83–87. [Google Scholar]
Li, K.; Wang, Y.J. Design and Realization of Historical Human Geographical Information System Based on WebGIS. Geospat. Inf. 2019, 17, 59–61. [Google Scholar]
Cai, L.; Luo, L.K.; Wu, Y. On the Construction of Hunan Modern Figures Database. Libr. Work. Coll. Univ. 2009, 29, 29–31. [Google Scholar]
Qian Xuesen Library. Available online: https://www.qianxslib.sjtu.edu.cn/ (accessed on 21 February 2025).
Clergy of the Church of England Database. Available online: https://theclergydatabase.org.uk/ (accessed on 21 February 2025).
Prosopography of the Byzantine World. Available online: https://pbe.kcl.ac.uk/ (accessed on 21 February 2025).
PASE. Available online: https://pase.ac.uk/ (accessed on 21 February 2025).
Alexander Street Press. Available online: https://alexanderstreet.com/ (accessed on 21 February 2025).
China Biographical Database Project (CBDB). Available online: https://projects.iq.harvard.edu/cbdb/home (accessed on 21 February 2025).
Xu, J.J.; Ge, H.M. The Figures Database Construction in Domestic Libraries. Digit. Libr. Forum 2015, 12, 50–55. [Google Scholar]
Zhang, J.; Qian, Y.; Leng, H.; Hou, S.; Chen, J. Survey of named entity recognition research based on deep learning. Mod. Electron. Tech. 2024, 47, 32–42. [Google Scholar] [CrossRef]
Chen, S.; Wang, H. China Biographical Database (Cbdb): A Relational Database for Prosopographical Research of Pre-Modern China. J. Open Humanit. Data 2022, 8, 13. [Google Scholar] [CrossRef]
Yang, G.C.-Y.; Koo, A. Transitions across Borders: Migration Aspirations of Young People from Kinmen, Taiwan. Popul. Space Place 2024, 30, e2843. [Google Scholar] [CrossRef]
Zhu, M.; Vidal, S. Family Migration in China: A Longitudinal Analysis of Couples’ Migration Behaviour. Popul. Space Place 2024, 30, e2751. [Google Scholar] [CrossRef]
Baláž, V.; Lichner, I.; Jeck, T. Geography of Migration Motives: Matching Migration Motives with Socioeconomic Data. Morav. Geogr. Rep. 2023, 31, 141–152. [Google Scholar] [CrossRef]
Berman, M.; Wang-Cendejas, R. Rural–Urban Migration of Alaska Indigenous Peoples: Changing Patterns and Drivers. Ann. Reg. Sci. 2024, 73, 1865–1883. [Google Scholar] [CrossRef]
Zhang, D.; Yiwen, Z.; Fu, G. Understanding Counter-Urbanization and Re-Urbanization in Pandemic: Insights from People’s Migration Behavior in China. Habitat Int. 2024, 150, 103116. [Google Scholar] [CrossRef]
Ren, T.; Xu, Y.; Liu, L.; Guo, E.; Wang, P. Identifying Vital Nodes in Complex Network by Considering Multiplex Influences. Adv. Complex Syst. 2023, 26, 2350009. [Google Scholar] [CrossRef]
Rezaei, A.A.; Munoz, J.; Jalili, M.; Khayyam, H. A Machine Learning-Based Approach for Vital Node Identification in Complex Networks. Expert Syst. Appl. 2023, 214, 119086. [Google Scholar] [CrossRef]
Zhang, J.; Liang, W. Identification of Important Nodes Based on Entropy and Neighborhood Relations in Complex Network. In Proceedings of the 2nd International Conference on Signal Processing, Computer Networks and Communications, Xiamen, China, 8–10 December 2023; pp. 332–338. [Google Scholar]
Yuan, B.; Song, T.; Yao, J. Identification of Important Nodes in the Information Propagation Network Based on the Artificial Intelligence Method. In Proceedings of the 2024 4th International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 12–14 January 2024; pp. 11–14. [Google Scholar]
Ni, L.; Ge, J.; Zhang, Y.; Luo, W.; Sheng, V.S. Semi-Supervised Local Community Detection. IEEE Trans. Knowl. Data Eng. 2023, 36, 823–839. [Google Scholar] [CrossRef]
Ruggeri, N.; Contisciani, M.; Battiston, F.; De Bacco, C. Community Detection in Large Hypergraphs. Sci. Adv. 2023, 9, eadg9159. [Google Scholar] [CrossRef]
Hernández-García, Á.; Cuenca-Enrique, C.; Traxler, A.; López-Pernas, S.; Conde-González, M.Á.; Saqr, M. Community Detection in Learning Networks Using R. In Learning Analytics Methods and Tutorials: A Practical Guide Using R; Springer: Cham, Switzerland, 2024; pp. 519–540. [Google Scholar]
Kojaku, S.; Radicchi, F.; Ahn, Y.-Y.; Fortunato, S. Network Community Detection via Neural Embeddings. Nat. Commun. 2024, 15, 9446. [Google Scholar] [CrossRef]
Chen, D.; Su, H.; Wang, X.; Pan, G.-J.; Chen, G. Finite-Size Scaling of Geometric Renormalization Flows in Complex Networks. Phys. Rev. E 2021, 104, 034304. [Google Scholar] [CrossRef]
Garuccio, E.; Lalli, M.; Garlaschelli, D. Multiscale Network Renormalization: Scale-Invariance without Geometry. Phys. Rev. Res. 2023, 5, 043101. [Google Scholar] [CrossRef]
Zheng, M.; García-Pérez, G.; Boguñá, M.; Serrano, M.Á. Geometric Renormalization of Weighted Networks. Commun. Phys. 2024, 7, 97. [Google Scholar] [CrossRef]
Chen, D.; Su, H. Extracting High-Fidelity Smaller Scale Subgraphs of Complex Networks by Edge-Reinforced Random Walk. IEEE Trans. Comput. Soc. Syst. 2024, 11, 6181–6191. [Google Scholar] [CrossRef]
Wang, Y. Research on the TF–IDF Algorithm Combined with Semantics for Automatic Extraction of Keywords from Network News Texts. J. Intell. Syst. 2024, 33, 20230300. [Google Scholar] [CrossRef]
Guo, W.; Wang, Z.; Han, F. Multifeature Fusion Keyword Extraction Algorithm Based on TextRank. IEEE Access 2022, 10, 71805–71813. [Google Scholar] [CrossRef]
Tang, M.; Gandhi, P.; Kabir, M.A.; Zou, C.; Blakey, J.; Luo, X. Progress Notes Classification and Keyword Extraction Using Attention-Based Deep Learning Models with BERT. arXiv 2019, arXiv:1910.05786. [Google Scholar]
Liu, B.; Lv, Z.; Zhu, N.; Chang, D.; Lu, M. Hot Keyword Extraction of Sci-Tech Periodicals Based on the Improved BERT Model. KSII Trans. Internet Inf. Syst. (TIIS) 2022, 16, 1800–1817. [Google Scholar]
Lu, X.Y.; Zheng, Y.; Zan, X. Keyword Extraction for Product Research and Development Documents Using BERT-BiLSTM-TFIDF. Ind. Eng. Manag. 2023, 28, 99–106. [Google Scholar] [CrossRef]
Gupta, A.; Chadha, A.; Tewari, V. A Natural Language Processing Model on Bert and Yake Technique for Keyword Extraction on Sustainability Reports. IEEE Access 2024, 12, 7942–7951. [Google Scholar] [CrossRef]
Zhang, Y.; Song, H.X. A study on the Temporal Characteristics of the Distribution of Buddhist Temples in Hubei Based on Local Records. Tradit. Chin. Archit. Gard. 2023, 168, 69–72. [Google Scholar]
Guo, C.; Hu, D.; Du, X.H.; Li, D.W.; Yang, Y.C.; Cheng, X.H. A dataset of centennial figures in the history of Nanjing. China Sci. Data 2020, 5, 313–324. [Google Scholar] [CrossRef]
School of Computer Science, Beijing University of Technology, Yuan Wu. Available online: https://cs.bit.edu.cn/szdw/jsml2/rjznyrjgcyjs2/6817cd3b9c534b2f8fcb62c4cfd4e2dd.htm (accessed on 5 March 2025).
Yuan, W.; Yuan, W. Iteration-Based Three-Step Unsupervised Chinese Word Segmentation. Method. Patent CN108062305B, 17 December 2021. [Google Scholar]
Yuan, W.; Zhuang, D.; Yuan, W.; Dongsheng, Q. Equal Arc Ratio Projection and a New Spherical Triangle Quadtree Model. Int. J. Geogr. Inf. Sci. 2010, 24, 1703–1723. [Google Scholar]
Iacopini, I.; Milojević, S.; Latora, V. Network Dynamics of Innovation Processes. Phys. Rev. Lett. 2018, 120, 048301. [Google Scholar] [CrossRef] [PubMed]
Chen, D.; Su, H.; Zeng, Z. Geometric Renormalization Reveals the Self-Similarity of Weighted Networks. IEEE Trans. Comput. Soc. Syst. 2022, 10, 426–434. [Google Scholar] [CrossRef]
Wang, C.J. Study of Specialization of Population Data and Information System Based on GIS—A Case Study on Fujian Province. Master’s Thesis, Fujian Normal University, Fuzhou, China, 2005. [Google Scholar]
Li, R.J. On the Distribution and Redistribution of Population in Fujian. Science 1989, 1, 58–61. [Google Scholar]
Lin, X. The Population Changes and Urban Modernization of Modern Xiamen. South China Popul. 2007, 3, 38–45. [Google Scholar]
Zhu, Y. Migration and Population Changes in Fujian Since the 1980s. J. Fujian Norm. Univ. (Philos. Soc. Sci. Ed.) 1994, 1, 17–23. [Google Scholar]
Pan, D.L. Retrospective Review of Historical Memory and Narration of the Huanghuagang Uprising in the Early 20th Century. J. Hubei Univ. (Philos. Soc. Sci.) 2019, 46, 61–68. [Google Scholar] [CrossRef]
Wu, L.L. Changes of power in Yangzhou during the Xinhai Revolution. Stud. Repub. China 2022, 1, 107–118. [Google Scholar]
Tang, X. On the Impact of the Xinhai Revolution on Fujian Society. Fujian Hist. Chron. 2017, 1, 5–8. [Google Scholar]
Tang, W.F. Research on the Central Section of Alliance Association During the Revolution of 1911. Ph.D. Thesis, Nankai University, Tianjin, China, 2014. [Google Scholar]
Lin, P.H. A Study on Social Transformation and Cultural Development in Fujian Before and After the Xinhai Revolution. J. Open Univ. Fujian 2002, 48–50, 65. [Google Scholar]
Liu, L. Re-Examination of the Lmage of Local Warlords in Fujian During the Republic of China: An Investigation Centered on Chen Guohui. Master’s Thesis, Xiamen University, Xiamen, China, 2020. [Google Scholar]
Liu, J.Y. Study and Reflection on Spirit of Nanchang Uprising from Perspective of 100 years of Founding of Our Army. J. Party Sch. CPC Nanchang Munic. Comm. 2025, 23, 22–28. [Google Scholar]
Cao, M.H. A Preliminary Study on the Fujian Battlefield During the Northern Expedition. Mod. Chin. Hist. Stud. 1987, 1, 165–177. [Google Scholar]
Han, Z. An Analysis of the Military Strategy and Rapid Victory in the Fujian Battlefield During the Northern Expedition. Mil. Hist. Res. 2001, 4, 63–69. [Google Scholar]
Zhong, R.X.; Lin, C.R. A Brief Account of the Struggles of the People in Quanzhou and Its Counties from the May Fourth Movement to the Great Revolution. CPC Hist. Res. Teach. 1981, 2, 13–22. [Google Scholar]
Huang, Q.Q. The Revolutionary Deeds of Martyr Fang Erhao. CPC Hist. Res. Teach. 1981, 4, 19–26. [Google Scholar]
Chen Ming. Available online: https://baike.baidu.com/item/%E9%99%88%E6%98%8E/6769972 (accessed on 2 April 2025).
Xiong, Y.Q. A Study of the Anti-Japanese and Anti-Tenacious Struggle Led by the CPC Fujian Provincial Committee During the War (1937–1945). Master’s Thesis, Fujian Normal University, Fuzhou, China, 2023. [Google Scholar]
Eastern China Field Army. Available online: https://zh.wikipedia.org/wiki/%E5%8D%8E%E4%B8%9C%E9%87%8E%E6%88%98%E5%86%9B (accessed on 2 April 2025).
Wu, M.G. The Historical Position and Role of Fujian in the Chinese People’s War of Resistance Against Japanese Aggression. Fujian Party Hist. Mon. 2015, 11, 40–47. [Google Scholar]
Wang, S.Z. Fujian’s Historical Role and Significant Contributions in the National War of Resistance Against Japan. Fujian Party Hist. Mon. 2016, 1, 39–43. [Google Scholar]
Zeng, G.X. The CPC Minzhong Special Committee and the Guerrilla Warfare Against Japanese Invasion Along the Fujian Coast. CPC Hist. Res. Teach. 2015, 5, 49–60. [Google Scholar]
Wang, Y.J. The Role and Contribution of Fujian Overseas Chinese in the Chinese People’s War of Resistance against Japanese Aggression. Fujian Party Hist. Mon. 2015, 11, 11–15. [Google Scholar]
The Chinese People’s Political Consultative Conference. Available online: https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%9B%BD%E4%BA%BA%E6%B0%91%E6%94%BF%E6%B2%BB%E5%8D%8F%E5%95%86%E4%BC%9A%E8%AE%AE%E7%AC%AC%E4%B8%80%E5%B1%8A%E5%85%A8%E4%BD%93%E4%BC%9A%E8%AE%AE (accessed on 2 April 2025).
Zhou, Z.X. A Study on the Yangtze River Detachment’s Southward Advance to Fujian. Master’s Thesis, Party School of the CPC Central Committee, Beijing, China, 2018. [Google Scholar]

Figure 1. Flowchart of this study.

Figure 2. Figure of TextNet.

Figure 3. Flowchart of character behavior time series.

Figure 4. Diagram of the character spatiotemporal chain.

Figure 5. Character state change matrix.

Figure 6. This figure depicts the network after the removal of unimportant edges and nodes; each node represents a location, and the weight of each edge corresponds to the strength of the character movement between those locations. (a) original character movement network; (b) character movement network reduced to 40 target nodes; (c) character movement network further reduced to 10 target nodes.

Figure 7. This is the page displayed when searching for the name ‘范志辉 (Fan Zhihui)’. Note that our website is in Chinese.

Figure 8. Trends in character status from 1840 to 2009.

Figure 9. This is the results for three metrics. (a) Complementary cumulative distribution

p_{c} (s_{r e s}^{(l)})

of rescaled node strength

s_{r e s}^{(l)} = s^{(l)} / m e a n (s^{(l)})

. (b) Complementary cumulative distribution

p_{c} (U_{r e s}^{(l)})

of rescaled unit strength

U_{r e s}^{(l)} = U^{(l)} / m e a n (U^{(l)})

. (c) Complementary cumulative distribution

p_{c} (w_{r e s}^{(l)})

of rescaled edge weight

w_{r e s}^{(l)} = w^{(l)} / m e a n (U^{(l)})

.

Figure 9. This is the results for three metrics. (a) Complementary cumulative distribution

p_{c} (s_{r e s}^{(l)})

of rescaled node strength

s_{r e s}^{(l)} = s^{(l)} / m e a n (s^{(l)})

. (b) Complementary cumulative distribution

p_{c} (U_{r e s}^{(l)})

of rescaled unit strength

U_{r e s}^{(l)} = U^{(l)} / m e a n (U^{(l)})

. (c) Complementary cumulative distribution

p_{c} (w_{r e s}^{(l)})

of rescaled edge weight

w_{r e s}^{(l)} = w^{(l)} / m e a n (U^{(l)})

.

Figure 10. Main directions of movement of historical figures between 1840 and 1910.

Figure 11. Main directions of movement of historical figures between 1960 and 2009.

Figure 12. Main directions of movement of historical figures between 1910 and 1913.

Figure 13. Main directions of movement of historical figures between 1924 and 1927.

Figure 14. Main directions of movement of historical figures between 1937 and 1945.

Figure 15. Main directions of movement of historical figures between 1945 and 1949.

Table 1. The top 10 keywords from 1910 to 1913 and their similarity scores.

Keyword	Keyword (English) *	Similarity
南京临时政府	The Nanjing Provisional Government	0.6284
广州起义	Guangzhou Uprising	0.6083
武昌起义	Wuchang Uprising	0.582
中华革命	Chinese Revolution	0.5706
中国人民解放军	Chinese People’s Liberation Army (CPLA)	0.5681
广东省政府	The People’s Government of Guangdong Province	0.5601
中华民国政府	Republic of China	0.5554
任粤军	Appointment to the Cantonese Army	0.5318
政权	political power	0.5214
革命政权	Revolutionary regime	0.5152

* This column is the English translation of the Chinese keyword.

Table 2. The top 10 keywords of from 1924 to 1927 and their similarity scores.

Keyword	Keyword (English) *	Similarity
南昌起义	Nanchang Uprising	0.7696
农民起义	Peasant revolt	0.6198
武装起义	Armed uprising	0.6168
起义	uprising	0.6036
游击区	guerrilla zone	0.5939
叛乱	rebellion	0.5904
游击队	guerilla	0.5891
起义军	insurrectionary army	0.5873
国民革命军	National Revolutionary Army	0.5801
革命军	Revolutionary Army	0.5669

* This column is the English translation of the Chinese keyword.

Table 3. The top 10 keywords of from 1937 to 1945 and their similarity scores.

Keyword	Keyword (English) *	Similarity
中国人民解放军	CPLA	0.6201
中国人民解放军空军	Air Force of the CPLA	0.6082
北京军区	Beijing Military Region	0.5897
福州军区	Fuzhou Military Region	0.5788
华东野战军	Eastern China Field Army	0.5754
朝鲜人民军	the Korean People’s Army	0.5719
驻华大使	ambassador	0.5686
中国空军	Chinese air force	0.553
台湾人	Taiwanese	0.5501
广西军区	Guangxi Military Region	0.5488

* This column is the English translation of the Chinese keyword.

Table 4. The top 10 keywords of from 1945 to 1949 and their similarity scores.

Keyword	Keyword (English) *	Similarity
福州军区	Fuzhou Military Region	0.5691
中国民主同盟中央委员会	China Democratic League Central Committee	0.5578
中央警官	Central Police	0.5048
中国人民解放战争	Chinese People’s War of Liberation	0.5584
中国人民解放军	CPLA	0.5686
北京大学法学院	Peking University Law School	0.4877
中国人民政治协商会议	Chinese People’s Political Consultative Conference	0.5574
广州市委	Guangzhou Municipal Government	0.5095
华北军区	Northern China Military Region	0.5347
广西军区	Guangxi Military Region	0.5587

* This column is the English translation of the Chinese keyword.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jian, X.; Yuan, W.; Yuan, W.; Gao, X.; Wang, R. Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian. Information 2025, 16, 324. https://doi.org/10.3390/info16040324

AMA Style

Jian X, Yuan W, Yuan W, Gao X, Wang R. Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian. Information. 2025; 16(4):324. https://doi.org/10.3390/info16040324

Chicago/Turabian Style

Jian, Xueyan, Wen Yuan, Wu Yuan, Xinqi Gao, and Rong Wang. 2025. "Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian" Information 16, no. 4: 324. https://doi.org/10.3390/info16040324

APA Style

Jian, X., Yuan, W., Yuan, W., Gao, X., & Wang, R. (2025). Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian. Information, 16(4), 324. https://doi.org/10.3390/info16040324

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automated Construction and Mining of Text-Based Modern Chinese Character Databases: A Case Study of Fujian

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources

2.2. Study Area

2.3. Construction of Spatio-Temporal Chain

2.3.1. Character Behavior Time Series

2.3.2. Processing of Character Spatial Location

2.4. Spatiotemporal State Changes of Characters

2.4.1. Character State Change Matrix

2.4.2. Main Direction of Character Movement

2.4.3. Keyword Extraction

3. Results

3.1. Automatic Construction and Querying of Spatio-Temporal Chains

3.2. Results of Character State Changes

4. Discussion

4.1. Verification of Random Walk

4.2. In-Depth Analysis of Character State Changes

4.2.1. Stable Period

4.2.2. Volatile Period

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI