1. Introduction
With the continuous development of economy and society, China’s demand for water resources continues to grow, and the contradiction between supply and demand is becoming increasingly prominent [
1,
2,
3]. To improve the efficiency of water resource utilization, the water resources department has put forward refined management measures for enterprise water use, and the aim of water industry classification is to improve the efficiency and accuracy of water resource management [
4]. By classifying water consumption, analyzing industrial water consumption data, collating water consumption information from the water supply network, and supporting the closed-loop analysis of water consumption data, the water demand and usage of various industries can be better understood, and more scientific water resource management policies and measures can be formulated [
5,
6]. However, corporate water use data collected from grassroots units are often of low quality, with industry information frequently missing, making it difficult to accurately obtain water use data across industries. By analyzing the original data, we find that there is a high correlation between enterprise name and industry, and the enterprise industry type can be determined according to the semantic features in the text of an enterprise’s name. However, the sparsity of features in short texts such as enterprise names and the significant heterogeneity of water usage amongst enterprises pose a huge challenge to extracting effective semantic features from water using enterprise names. For example, differences in enterprise structure and organizational forms lead to inconsistent naming rules, which interfere with the extraction of core business features by models. The dynamic differences in regional water use patterns result in different semantics for words in the same industry, leading to distorted contextual representation. This complexity makes it difficult for existing classification methods based on manual rules or simple statistics to adapt to dynamic changes, so determining how to classify industries from these texts is a challenge in current research.
In recent years, the vigorous development of deep learning has brought new opportunities to solve the problem of text classification [
7,
8]. On the one hand, deep learning methods can automatically mine essential features in text and capture deep semantic representation information from text data, avoiding the process of the manual design of rules and features; on the other hand, deep learning models can map the information of different modes into vector space, alleviating the problem of information shortage faced by discriminating from a single datum. For example, in 2013, Google proposed Word2Vec [
9], which uses the context of words to transform text data into structured and computable dense vectors, greatly improving the performance of text classification. In 2018, the University of Washington proposed ELMo [
10], which uses a multilayer bidirectional long short-term memory network (LSTM) to model the syntactic and semantic features of text. With the rise of transformers in recent years [
11], a series of pre-trained language models such as BERT [
12] can now be used to learn a general representation of text from a massive corpus, significantly improving the accuracy of downstream classification tasks. However, short texts such as enterprise names are characterized by fewer words, more differences, and sparse features, which lead to many problems in text segmentation, text representation, feature extraction, and classification model building.
Traditional models based on convolutional neural networks (CNNs), LSTM, and transformers mainly extract contextual information from sequence text for classification [
13,
14,
15]. However, it is worth noting that text contains not only sequence features, but also rich graph structure information, such as syntax and co-words [
16]. Many researchers use graph convolutional networks (GCNs) to update the representation of nodes by iteratively aggregating the neighbor information of nodes, and capturing the graph structure information for text classification, thus achieving superior performance. For example, graph-CNN [
17] first converts text into word graphs, uses GCN to represent the word graphs, and finally obtains document labels through a classifier. TextGCN [
18] takes words and documents as nodes at the same time and constructs a heterogeneous word text graph to represent text data and realize the classification of documents.
Enterprise names are typically composed of four elements: administrative division, brand name, industry or business characteristics, and organizational form. As a result, enterprises in related industries may share identical business characteristics words, which can be harnessed to construct a graph, thereby enabling the application of a GCN for classification. However, due to the interference of administrative division and organizational form, direct composition can easily be used to form a dense map, which affects the accuracy of industry identification. Therefore, it is difficult to directly apply the existing text classification method to the field of water enterprise classification. Based on this, we propose a bidirectional encoder representations from transformers and graph convolutional network (BERT-GCN) model based on a strong link diagram for the classification of water use enterprises to improve the accuracy of industry identification. First, to solve the problem of the construction of the co-word relation between enterprises, we designed a co-word relation graph based on the industry keywords extracted by word frequency–inverse document frequency (TF-IDF), and extracted the co-word relation features using the graph convolutional network (GCN). Then, considering the low word count and semantic information of enterprise names, we used a web crawler to collect the main business data of the enterprises from Aiqicai (an enterprise credit query tool launched by Baidu) as supplementary information, and extracted semantic features from the pre-trained language model BERT. Finally, we connected the co-word relation features and semantic features and obtained the classification results of minor categories and major categories via the fully connected layers. The main contributions are as follows:
(1) The fusion and utilization of multi-source information improve classification accuracy. We construct a co-word relation graph to extract the co-word relation features. At the same time, a web crawler is used to extract the main business information of enterprises as supplementary data, and BERT is used to extract semantic features from that supplementary data. This method makes comprehensive use of multi-source information and can fully integrate the characteristics of many aspects to categorize enterprise industries.
(2) A strong link diagram is constructed. Based on the statistics of naming preferences of enterprises in different industry categories, TF-IDF is used to select words that can represent the characteristics of industry categories as co-words, and a strong link relation graph is constructed to remove co-words that contribute little to the final classification task, thus improving the precision of the industry classification of water use enterprises.
(3) The short text classification method is aimed specifically at the problem of the industry categorization of water enterprises. To verify the reliability and effectiveness of the proposed method, we utilize two real data sets from Xiuzhou District of Jiaxing City and Zhuji City. Compared with TextCNN, BERT-FC, TextGCN, and Word2Vec-GCN, the classification method we propose has the best performance in terms of precision, recall, and F1-score. It has important practical significance for regional water management.
The following sections of this paper are organized as follows:
Section 2 describes the current research in the field.
Section 3 introduces the method.
Section 4 compares the performance between the BERT-GCN model and other classification models.
Section 5 provides discussion and conclusions.
3. Methods
3.1. Problem Definition
Water resources are crucial for social and economic development, energy, ecosystem health, and human survival. Therefore, it is important to have real-time, dynamic, and rapid access to water usage information for different industries and sectors. At present, the quality of water use data collected from grassroots units is low and cannot meet the water resource management needs of different industries and sectors. Therefore, it is necessary to conduct research on the existing basic information within the water supply network, establish and improve an industry identification model for water use information, and classify the basic information of water users into industry categories. Thus, we propose the BERT-GCN model for the classification of water use enterprises to improve the accuracy of industry identification.
The classification of water use enterprises is mainly carried out using the short text classification method according to the semantic characteristics of the enterprise name. However, it is difficult for the commonly used classification methods to capture effective features for industry categorization. Therefore, we enhance the semantic information of enterprise names from two aspects. First, by developing statistics on the naming preferences of enterprises in different industry categories, the
TF-IDF algorithm is used to screen out words that can represent the characteristics of industry categories as co-words. A co-word relation graph of enterprise names is constructed, and the co-word relation features are extracted using GCN. Second, a web crawler is used to obtain the main business description text as supplementary information, and BERT is used to extract its semantic features. Assuming enterprise
is given, this study transforms the task of categorizing water use enterprises into the construction of a mapping relation
between enterprise
, strong link relation graph
, and industry type, as shown in Equation (1):
where
is the enterprise to be classified;
is the text describing the main business of the enterprise;
is a strong link relation graph; and
is the industry category of the water use enterprises, including major and minor categories.
3.2. Model Structure Overview
To solve the problem that water use information from water supply networks cannot be used to meet the needs of industrial water management, a BERT-GCN model based on a strong link diagram is proposed for the industrial categorization of water enterprises. The technical process, as shown in
Figure 1, is divided into four stages:
(1) Data preprocessing. Data preprocessing involved two steps. First, based on the collected water usage data from the enterprises, we segmented the enterprise names and removed stop words. Then, we collected the main business scope data of the enterprise from Aiqicha (an enterprise credit query tool launched by Baidu) as supplementary information using web crawlers.
(2) Co-word relation feature extraction. The TF-IDF algorithm was used to extract high-frequency words that represent the characteristics of the industry. These high-frequency words were used as links to construct a strong link relation graph, which was then input into a GCN. The GCN further aggregated the word vector information to extract the co-word relation feature of the enterprise.
(3) Semantic feature extraction. The pre-trained BERT model was used to extract contextual information from the main business data of the enterprise as supplementary semantic features.
(4) Enterprise classification. First, we concatenated the co-word relation features and semantic features and obtained the classification results for the major and minor industrial categories through the fully connected layer. Then we compared the obtained enterprise industry classification results with the original industry classification results and calculated the accuracy of the BERT-GCN model.
Figure 1.
BERT-GCN model technical flow chart.
Figure 1.
BERT-GCN model technical flow chart.
3.3. Co-Word Relation Feature Extraction Based on GCN
First, we used TF-IDF to identify the words with high occurrence frequency as industry characteristic keywords. Then, we constructed a co-word relation graph where enterprise names are nodes and keywords are edges. Finally, we applied GCN to aggregate neighbor information and extracted features from the co-word relation graph.
- (1)
Keyword extraction
TF-IDF is a text feature selection method based on statistics and is used to evaluate the importance of words [
33].
TF stands for word frequency, which refers to the frequency of a word appearing in an article, while
IDF is the inverse text frequency index, which represents the frequency of a keyword appearing in the corpus. If a word appears frequently in an article and rarely appears in other articles, it is considered to have good discriminative ability and can be used as a feature of the article to represent it.
Enterprise names generally include industry or business characteristics, so enterprises in related industries may have the same business characteristic words; for example, the textile industry generally has names that contain “textile”, “knitting”, and other keywords, which can be used to create a co-word graph. Through the statistical analysis of enterprise names, it was found that they also contain information such as the location and nature of the enterprise (such as limited liability companies or limited companies). Traditional statistical methods can only count high-frequency words representing the industry based on the frequency of word occurrence, making it difficult to mine high-frequency words that can distinguish them from other industries. Therefore, this section introduces the TF-IDF algorithm to mine high-frequency words that can represent industry characteristics. The calculation steps for high-frequency words based on industry characteristics using the TF-IDF algorithm are as follows:
- ①
Assuming industry category C, the formula for calculating the TF value of word frequency for this industry category is shown in Equation (2):
where
represents the number of times word
appears in industry category C, and
is the total number of words obtained by segmenting the names of companies in that industry category.
- ②
The corresponding IDF value can be obtained by dividing the total number of industry categories by the number of industry categories containing word , and then taking the logarithm. The calculation formula for IDF is shown in Equation (3):
where
is the total number of industry categories;
is the number of industry categories that contain the word
(If all industry categories do not contain the word
, the denominator is 1).
- ③
The formula for calculating the TF-IDF value of word is shown in Equation (4):
After calculating the TF-IDF values of each word, keywords that represent industry characteristics can be selected based on a threshold.
- (2)
Co-word features extraction
When a keyword co-appears in two enterprise names, it is assumed that there is a co-word relation between the two enterprises. Based on this, we constructed a co-word relation graph and used a GCN to extract co-word features. A GCN is a type of deep learning model specifically designed for processing graph data and includes nodes, edges, and features. The GCN can extract features from the graph and generate corresponding representation vectors, mainly used for processing data with generalized topological graph structures. The GCN starts with a node and performs convolution operations on the feature information of adjacent nodes, propagating the node information to the surrounding area, so that each node can not only utilize its own features, but also the features of its neighboring nodes. We input the strong link relation graph and feature vectors into a GCN, and the word vector information was further aggregated to obtain the co-word features of the enterprise.
We constructed a graph
, where
and
are the sets of enterprise nodes and co-word relation edges, respectively. The single-layer convolution of the GCN is shown in Equation (5):
where
is the node feature matrix of the
th layer;
is the weight matrix of the
th layer;
is an adjacency matrix with self-looping addition,
is the original adjacency matrix;
is the identity matrix;
is the degree matrix of
, where each element on the diagonal is the degree of a node; and
is the ReLU activation function.
3.4. Semantic Feature Extraction Based on BERT
In this section, we employed BERT to convert enterprises’ main business information into feature vectors, thereby capturing contextual semantic information. BERT, a pre-trained language model developed by Google, has significantly advanced NLP by enabling bidirectional context understanding. Its transformer architecture eliminates the sequence dependencies found in traditional RNNs and CNNs, allowing parallel input sequence processing. This enables the extraction of word relationship features within sentences and across multiple levels, resulting in a more comprehensive reflection of sentence semantics. Unlike previous pre-trained models, BERT captures semantic information based on sentence context, reducing ambiguity. Additionally, its bidirectional semantic information extraction capability yields richer and more nuanced features. The BERT model′s input is illustrated in
Figure 2.
BERT first represents each character in the input company name as a semantic vector, inputs it into multiple transformer neural network encoders for training, and finally obtains the trained word vector. The most important structure in BERT is the transformer encoder, which includes key operations such as multi-head attention mechanism, self-attention mechanism, residual connection, layer normalization, linear transformation, etc. Through these operations, the transformer encoder can transform the semantic vectors of each word in the input enterprise name into enhanced semantic vectors of the same length. Through multiple layers of transformer encoders, BERT can train the semantic vectors of each word in the text.
3.5. Enterprise Classification Based on Multi-Level Constraints
To obtain the industry classification results for the water use enterprises, we first concatenated the semantic features extracted by the BERT model and the co-word features extracted by the GCN model to obtain an enhanced feature vector. Then, a fully connected layer with multi-level constraints was used to fit the feature vectors and map them to the probabilities of water use enterprises belonging to different industry categories. Due to the hierarchical structure of industry classification in the national economy, we used two fully connected layers to predict the major and minor categories. The calculation principle is shown in Equations (6) and (7):
where
and
are the major and minor industry categories of the enterprise predicted by the model;
and
are the weight coefficients of the fully connected layers of the major and minor categories; and
and
are the bias terms of the fully connected layers of the major and minor categories, respectively.
To balance the prediction results for both major and minor industry categories, the loss function of the model is defined as the sum of the predicted losses for major and minor industry categories. The formula is shown in Equation (8):
where
and
are the cross-entropy loss functions for the predicted major and minor classes, respectively.
5. Conclusions
We propose a BERT-GCN model based on a strong link relation graph for industry classification of water use enterprises, which improves the accuracy of enterprise classification. We create a strong link relation graph based on the industry keywords extracted by TF-IDF, and extract co-word relation features by GCN. Then, we extract semantic features from the main business data collected by web crawlers. Finally, the semantic features and co-word relation features are connected to enhance the feature vector, and the classification results of the enterprise industry are obtained through a fully connected layer with multi-level constraints. The method was validated using two datasets from Jiaxing City and Zhuji City in Zhejiang Province. The experimental results showed that, compared with text classification methods such as TextCNN, BERT-FC, TextGCN, and Word2Vec-GCN, the BERT-GCN, based on a strong link relation graph, can obtain more complete semantic features from low-quality data for enterprise classification. Through classification models, we can predict the missing industry information for water-using enterprises. This enables us to statistically analyze water demand and usage per industry, assess each industry’s water use efficiency, and formulate more scientific water resource management policies and measures.
However, this study still has several limitations, such as data sources and generalization, regional and industry applicability, and label uncertainty. Therefore, in future research, we plan to integrate more diverse enterprise-related information to construct multidimensional relationship graphs, collaborate with external data platforms, and standardize data. Meanwhile, due to the black box nature of deep learning models, although our method can extract rich features for the industry classification of water companies, users still cannot understand the basis of model decisions. The joint application of federated learning (FL) and explainable artificial intelligence (XAI) to alleviate data privacy and model interpretability issues is novel and timely. Therefore, in future research, we plan to combine FL and XAI to construct a federated learning framework to protect data privacy, while introducing interpretability techniques to analyze industry classification decision criteria for further interpretability.