Person Entity Alignment Method Based on Multimodal Information Aggregation

Wang, Huansha; Huang, Ruiyang; Zhang, Jianpeng

doi:10.3390/electronics11193163

Open AccessArticle

Person Entity Alignment Method Based on Multimodal Information Aggregation

by

Huansha Wang

,

Ruiyang Huang

^* and

Jianpeng Zhang

National Digital Switching System Engineering & Technological R&D Center, PLA Information Engineering University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3163; https://doi.org/10.3390/electronics11193163

Submission received: 2 September 2022 / Revised: 27 September 2022 / Accepted: 28 September 2022 / Published: 1 October 2022

Download

Browse Figures

Versions Notes

Abstract

:

Entity alignment is used to determine whether entities from different sources refer to the same object in the real world. It is one of the key technologies for constructing large-scale knowledge graphs and is widely used in the fields of knowledge graphs and knowledge complementation. Because of the lack of semantic connection between the visual modality face attribute of the person entity and the text modality attribute and relationship information, it is difficult to model the visual and text modality into the same semantic space, and, as a result, that the traditional multimodal entity alignment method cannot be applied. In view of the scarcity of multimodal person relation graphs datasets and the difficulty of the multimodal semantic modeling of person entities, this paper analyzes and crawls open-source semi-structured data from different sources to build a multimodal person entity alignment dataset and focuses on using the facial and semantic information of multimodal person entities to improve the similarity of entity structural features which are modeled using the graph convolution layer and the dynamic graph attention layer to calculate the similarity. Through verification on the self-made multimodal person entity alignment dataset, the method proposed in this paper is compared with other entity alignment models which have a similar structure. Compared with AliNet, the probability that the first item in the candidate pre-aligned entity set is correct is increased by 12.4% and average ranking of correctly aligned entities in the candidate pre-aligned entity set decreased by 32.8, which proves the positive effect of integrating multimodal facial information, applying dynamic graph attention and a layer-wise gated network to improve the alignment effect of person entities.

Keywords:

multimodal knowledge graph; dynamic graph attention; graph convolution network; person entity alignment; representation learning

1. Introduction

A person relation graph is a kind of vertical domain knowledge graph with a person entity as the core. It stores person attributes and person relations in the form of a semantic knowledge network, so as to clearly and intuitively display person entity-related information and apply it to the practice of downstream tasks. An entity refers to the object existing in the objective world, and a person entity is an entity which refers to a person object in reality. Person attributes refer to the description of person entity’s characteristics. Person relations refers to the specific relationships between person entities, including kinship and social relationships. Each piece of knowledge on the person relation graph is represented as a triplet. According to different types of knowledge expressed by triplets, triplets can be divided into attribute triplets as

< h e a d P e r s o n E n t i t y, p e r s o n A t t r i b u t e, t a i l P e r s o n E n t i t y >

and relation triplets as

< h e a d P e r s o n E n t i t y, r e l a t i o n, t a i l P e r s o n E n t i t y >

. Figure 1 shows how the person relation graph regards entities as nodes and relations as edges, thus representing the person-related information with a graph structure. Because the person relation graph can more clearly and intuitively show relationships and attribute information to people, it plays an important supporting role in the fields of personal relationship reasoning, person-related information retrieval and user portraits.

In order to build a high-quality and comprehensive person relation graph, it is often necessary to collect, gather and fuse person-related information from multi source data. Therefore, person entity alignment is crucial to the construction of a person relation graph. The task of entity alignment is to use models or algorithms to determine whether entities from different sources refer to the same object in the real world. We formalize two person relation graphs that need to be aligned as

G_{1} = (E_{1}, R_{1}, A_{1})

and

G_{2} = (E_{2}, R_{2}, A_{2})

, where

E_{i}, R_{i}, A_{i}, i \in {1, 2}

represent the person entity set, relation set and attribute set of the person relation graphs, respectively. Person entity alignment is used to find an alignment entity set

S = {(e_{1}, e_{2}) \in E_{1} \times E_{2} | e_{1} \Leftrightarrow e_{2}}

. The symbol

\Leftrightarrow

means that the entity

e_{1}

from

E_{1}

and the entity

e_{2}

from

E_{2}

have the same semantics, i.e.,

e_{1}

and

e_{2}

refer to the same person in reality.

According to different representation learning mechanisms, entity alignment methods are divided into two types: methods based on the transfer distance model and methods based on the graph convolution network. Methods based on the transfer distance model mainly obtain low-dimensional feature representation of entities through transfer distance models such as TransE [1]. Methods based on the graph convolution network first extract the entity representation transformed into a graph structure and then the entity is modeled as a low-dimensional vector by using graph convolution or a graph attention network to obtain entity pairs with high similarity [2]. Since models based on the graph convolution network can make full use of the pre-aligned seed entity pairs information, their overall effect is marginally better than models based on transfer distance [3].

To sum up, the current mainstream entity alignment method is to calculate the similarity by modeling the entity as a low-dimensional embedding. However, as far as the person relation graph is concerned, the sparse distribution of the person relation triplets and the limited range of the person attribute triplets make it difficult for the person entity to effectively be represented as a low-dimensional embedding. Specifically, since the special attributes of person entities, i.e., nationality, constellation and other attributes, that appear in the person relation graph on a large scale all only have limited values, the attributes of a large number of person entities are likely to have high similarity. In extreme cases, it even occurs that all attributes of multiple person entities have the same values. If the above model method is directly used for modeling, the accuracy of person entity alignment will be affected. In summary, it is difficult to directly apply the mainstream entity alignment mechanism based on low-dimensional modeling, because the people attribute data are similar and the people relation information is sparse in person relation graphs.

With the emergence of a large number of multimodal data such as images, videos and voice data, multimodal knowledge graphs have become an important aspect of the development of knowledge graphs. For example, RichPedia [4] and other multimodal knowledge graph datasets have shown that they are far more abundant and complete than unimodal knowledge graphs and have significantly improved the universality of knowledge graphs and the effect of various downstream tasks. Therefore, work based on multimodal knowledge graphs has become a research hotspot in recent years. The task of multi-entity alignment is divided into two subtasks: the alignment of text entities with visual entities and multi-attribute entity alignment. The alignment of text entities with visual entities is similar to the task of image–text matching, and the joint representation is obtained by using a multi-pretraining model or multi-objective iterative training. Multimodal attribute entity alignment methods mainly fuse visual information and text information in the entity attributes to transfer the visual and text features to the common vector space and learn the multi representation of the entity to calculate the similarity [5]. However, the semantic span between the visual facial information, the text modality attribute and relationship information is large, and it is difficult to model it into the same semantic space using the multi-pretraining model. Therefore, most current person entity alignment methods only use attribute and relationship information and ignore image information. As a unique biometric attribute of person entities, human facial information has obvious advantages in distinguishing between different person entities and the task of person entity alignment. However, due to the sparsity of facial attributes in the person relation graph and the complexity of open scene face recognition, a large amount of effective triplet information will be lost if person entity alignment is performed by using face alignment alone, resulting in a poor alignment effect. Therefore, this paper attempts to strengthen structural representations through facial and semantic information, so as to improve the accuracy of person entity alignment.

Currently, multi-information is not fully utilized in the task of person entity alignment, and the lack of a semantic connection between the person entity image and the text modality makes it difficult to conduct multi-modeling in person entity alignment. To solve these problems, this research first used crawler technology to obtain multi-person information data from multiple different source websites, so as to construct two multi-person relation graphs. Then, a multi-person entity alignment dataset was constructed by relation clustering and image processing. Secondly, inspired by the fact that the AliNet [6] model alleviates the non-isomorphism of the neighborhood structure of the corresponding entity in an end-to-end manner, a person entity alignment method that aggregates the structural features of the person entity and the facial and semantic features is proposed. Finally, an experiment was designed to verify the effectiveness of the proposed method.

The main contributions of this paper are as follows:

Two multi-person relation graphs were constructed, and the multi-person entity alignment dataset was further constructed. Based on face detection technology and network open-source semi-structured encyclopedia data, this study built two multi person relation graphs with different knowledge sources, and, through analysis, optimization and cleaning, created a multi-person entity alignment-based pedia, which contains 23,512 entities and 59,691 triplets. It is now open-source on GitHub.
A person entity alignment method based on multi-information aggregation is proposed. In this method, firstly, the single-hop and multi-hop neighborhood features of the person entity are extracted by using the graph convolution layer and the dynamic graph attention layer. Secondly, the layer-wise gated network is applied to aggregate the single-hop and multi-hop characteristics of the nodes comprehensively and reasonably, and the structural feature is carried out. Finally, the cascade convolution network is used to process the modal attributes of the entity image to detect the presence of the human face, and the pretrained SE-LResNet101E-IR [7] and bert-base-chinese [8] are used to extract the facial and semantic features of the person entity, which is then aggregated with the structural feature modeled by the entity relationship triad to enhance the low-dimensional vector representation of the target entity.
Based on the constructed multi-person entity alignment dataset, experiments were designed to verify the effectiveness of integrating multi-information on person entities, applying a dynamic graph attention network and a layer-wise gated network in the task of person entity alignment.

2. Materials and Methods

Inspired by the fact that AliNet alleviates the non-isomorphism of the corresponding entity neighborhood structure using the graph convolution network and the graph attention network, this paper proposes a person entity alignment method based on a multimodal information aggregation called PEAMA (person entity alignment method based on multimodal information aggregation). The non-isomorphism of the corresponding entity neighborhood structure refers to two entities which may have different neighborhood structures in different graphs. A person entity such as

< B a n K i M o o n >

with an attribute

< K o r e a >

in a triplet

< B a n K i m o o n, n a t i o n a l i t y, K o r e a >

is a single-hop neighborhood, i.e., entity nodes directly connected with a target entity node, but in triplets

< B a n K i m o o n, h u s b a n d, Y o o S o o n t d e >

and

< Y o o S o o n t d e, n a t i o n a l i t y, K o r e a >

is a multi-hop neighborhood. This makes it difficult for the model to better represent the entity.

Based on AliNet’s use the traditional graph attention network to make the model obtain the influence weight of each neighborhood node on the target node, for solving the non-isomorphism of the corresponding entity neighborhood structure, PEAMA extracts additional facial features from the image modal facial information of the person entity, so as to comprehensively represent the person entity in multiple dimensions and solve the problem of ignoring multi-information on person entities in existing person entity alignment tasks. In addition, PEAMA uses the dynamic graph attention network and the layer-wise gated network to optimize the original architecture of AliNet. Specifically, the model architecture of PEAMA is mainly divided into a graph convolution layer, a dynamic graph attention layer, a layer-wise gated layer and a feature extraction layer. The graph convolution layer is responsible for aggregating the single-hop node features of the entity. The dynamic graph attention layer uses the attention mechanism to obtain the attention coefficient of the multi-hop neighborhood and aggregate the multi-hop node features of the entity. The layer-wised gated layer is responsible for aggregating the single-hop and multi-hop information. Finally, the facial and semantic features of the person entity are obtained through the feature extraction layer and spliced with the entity structural feature, which used cosine similarity and greedy strategy to generate alignment candidate sorting. The overall architecture of the model is shown in Figure 2.

2.1. Graph Convolution Layer

The graph convolution layer of the model is responsible for recursively aggregating the input node feature and its single-hop neighborhood feature vector to learn the single-hop feature representation of the node. The core idea of the traditional graph convolution network is to increase the characteristic information of its neighbor nodes for each node, and its formula is as follows [9]:

H^{l + 1} = σ ({\hat{D}}^{- \frac{1}{2}} \hat{A} {\hat{D}}^{\frac{1}{2}} H^{l} W^{l})

(1)

where

\hat{D}

refers to the degree matrix,

\hat{A}

refers to the adjacency matrix of the target node,

H^{l}

refers to the implicit representation of the node in the

l th

layer,

W^{l}

refers to the weight matrix of the

l th

layer and

σ

refers to the activation function. Entity nodes are first modeled by TransE and then input into the convolutional layer for training and learning.

Based on this idea, in this model, the implicit feature representation of nodes in the graph convolution layer can be expressed as:

h_{i}^{l} = σ ({\sum_{j \in N_{i} (i) \cup {i}} \frac{1}{c_{i}} W}^{l} h_{j}^{l - 1})

(2)

where

N_{i}

refers to the one-hop neighborhood node set of the node

i

and

c_{i}

is the normalization constant. The activation function is not used in the graph convolution layer of this model.

2.2. Dynamic Graph Attention Layer

The dynamic graph attention layer of the model is responsible for calculating the attention weight between the multi-hop neighborhood nodes of the target entity, so as to highlight the useful multi-hop neighborhood nodes and aggregate their features to better represent the characteristics of the target entity. The input of the traditional static graph attention network is the node feature matrix

h = {{\vec{h}}_{1}, {\vec{h}}_{2}, \dots {\vec{h}}_{N}}, {\vec{h}}_{i} \in R^{F}

, where

N

is the number of nodes and

F

is the dimension of features, and the new node feature matrix

h^{'} = {{\vec{h}}_{1}^{'}, {\vec{h}}_{2}^{'}, \dots {\vec{h}}_{N}^{'}}, {\vec{h}}_{i}^{'} \in R^{F^{'}}

is output after feature extraction and attention coefficient calculation. The calculation formula for the attention mechanism between nodes

i

and

j

is as follows [10]:

e_{i j} = L e a k y R e L U ({\vec{a}}^{T} [W {\vec{h}}_{i} ‖ W {\vec{h}}_{j}])

(3)

α_{i j} = s o f t m a x (e_{i j}) = \frac{e x p (e_{i j})}{\sum_{k \in N_{i}} e x p (e_{i k})}

(4)

By the weighted summation of all neighbor nodes, the output characteristics obtained after the node passes through the static graph attention network can be obtained:

{\vec{h}}_{i}^{'} = σ (\sum_{j \in N_{i}} α_{i j} W {\vec{h}}_{j})

(5)

where

W

is the shared weight matrix,

{\vec{a}}^{T}

is the single-layer feed forward neural network,

σ

is the activation function and

| |

is the matrix splicing.

However, when applied to the task of entity alignment, the traditional graph attention network has two problems: the shared weight matrix and the static property. First, as shown in Formula (3), the traditional graph attention network only uses one shared weight matrix,

W

, when calculating the attention coefficients between nodes, but the entity nodes in the knowledge graphs are usually different from their adjacent nodes. For example, entity nodes and attribute nodes usually have large differences in terms of structure. Therefore, applying the shared weight matrix to different types of nodes will make it difficult for the model to correctly distinguish between target entity nodes and adjacent attribute nodes, thus reducing the representational ability of the model. In order to solve this problem, PEAMA uses two different matrices,

W_{1}

and

W_{2}

, to perform linear transformation on entity nodes and adjacent nodes, respectively. After initialization,

W_{1}

and

W_{2}

performs linear transformation with a target node feature and neighborhood node feature, respectively. The modified attention coefficient calculation formula is as follows:

e_{i j} = L e a k y R e L U ({\vec{a}}^{T} [W_{1} {\vec{h}}_{i} ‖ W_{2} {\vec{h}}_{j}])

(6)

α_{i j} = s o f t m a x (L e a k y R e L U ({\vec{α}}^{T} [W_{1} {\vec{h}}_{i} ‖ W_{2} {\vec{h}}_{j}]))

(7)

By using different linear transformation mechanisms for different nodes, the model enhances the discrimination ability of the graph attention network for nodes, so that the network can better extract features.

Secondly, from the finiteness of the relationship between adjacent nodes and entities and the monotonicity of

s o f t m a x

and

L e a k y R e L U

, it can be seen that there is node

j

for any node

i

to maximize

{\vec{a}}^{T} [W {\vec{h}}_{j}]

, so that the static graph attention network always tends to give node

j

the largest attention coefficient while ignoring the different relationships between different input nodes, that is, given a group of nodes and a pretrained graph attention layer, the attention function,

α

, has the same maximum tendency node

j

, so the static graph attention network is more suitable for mapping all inputs to the constant mapping of the same output. However, it is difficult to build a better model when different query inputs have different correlations with different nodes in the task of entity alignment. As shown in Formula (7), a traditional graph attention network continuously uses

W

and

{\vec{a}}^{T}

to perform linear transformation on vectors, then continuously uses

s o f t m a x

and

L e a k y R e L U

to perform nonlinear transformation, but both can use only once transform override. In order to solve this problem, PEAMA attempts to apply the dynamic graph attention network [11]. The main reason for the limited attention of the static graph attention network is that it simply uses the learnable matrices

{\vec{a}}^{T}

and

W

continuously, so that it can be degenerated into a single linear layer. Therefore, the dynamic graph attention network attempts to apply the nonlinear function

L e a k y R e L U

first and then input the feed-forward neural network

{\vec{a}}^{T}

in the calculation of the attention mechanism. The expression is modified as follows:

e_{i j} = {\vec{a}}^{T} L e a k y R e L U ([W_{1} {\vec{h}}_{i} ‖ W_{2} {\vec{h}}_{j}])

(8)

For any query node in the static graph attention network, the attention function is monotonous, that is, the order of attention coefficients is shared among all nodes in the graph and there are no conditions for the query nodes, which seriously affects the feature extraction ability of the network. However, in the modified dynamic graph attention network, each query has a different order for the attention coefficients of the keys, so it has a stronger representational ability. Moreover, the dynamic graph attention network avoids the simple continuous use of the learnable matrices

{\vec{a}}^{T}

and

W

, thus preventing the feed-forward neural network and the weight matrix from degenerating and decomposing into a single linear layer, thus realizing a general approximate attention function and improving the robustness of the model while obtaining stronger representational ability. Therefore, the dynamic graph attention network can significantly optimize the acquisition of the weight between nodes for the task of entity alignment with more complex relationships between nodes and different requirements for the ranking of nodes in different neighborhoods, thus improving the entity alignment effect.

In conclusion, the model enhances the feature extraction ability of the traditional graph attention mechanism by using different weight matrices for the target node and the adjacent node and applying the dynamic graph attention mechanism, which is more suitable for the task of entity alignment. Therefore, it can better obtain the multi-hop feature aggregation of the target node.

2.3. Layer-Wise Gated Layer

After obtaining the single-hop and multi-hop structural features of the person entity, PEAMA uses the gating network to reasonably aggregate the two to obtain a more comprehensive feature representation of the target entity. When PEAMA uses the traditional gating network, its aggregation mechanism is shown as follows:

h_{i}^{l} = g (h_{i, 2}^{l}) \cdot h_{i, 1}^{l} + (1 - g (h_{i, 2}^{l})) \cdot h_{i, 2}^{l}

(9)

where

h_{i, 1}^{l}

and

h_{i, 2}^{l}

represent the outputs of the graph convolution layer and the dynamic graph attention layer, respectively, and the gating network function can be expressed as:

g (h_{i, 2}^{l}) = σ (M h_{i, 2}^{l})

(10)

where

M

is the gating network weight matrix and

σ

is the activation function. However, at this time, the whole gating network only considers the multi-hop characteristics of nodes output by the dynamic graph attention layer. In order to more comprehensively and reasonably aggregate the single-hop and multi-hop characteristics of nodes and to enhance the fitting ability of the gating network and improve the sensitivity of the network when the multi-hop nodes change in a small range, the model proposes to improve the gating mechanism used in AliNet by the layer-wised gated network. The function of the layer-wised gated mechanism is as follows:

g (h_{i, 1}^{l}, h_{i, 2}^{l}) = t a n h (M_{1} h_{i, 1}^{l} + M_{2} h_{i, 2}^{l} + b)

(11)

The layer-wised gated gating network uses two different weight matrices to obtain the output of the graph convolution layer and the dynamic graph attention layer, i.e., the single-hop and multi-hop neighborhood information of the node, which makes the model more comprehensive when aggregating node characteristics. The network adds an offset matrix to enhance the representational ability of the gating network. At the same time, since the single-hop feature has been taken into account in the gating mechanism function, when adding non-linear factors to it, the function is used instead of the ReLU function to improve the sensitivity of the network output to changes in node characteristics and expand the value range to more flexibly balance the different weights of the single-hop and multi-hop features. The final node characteristics are as follows:

h_{i}^{l} = t a n h (g (h_{i, 1}^{l}, h_{i, 2}^{l}) h_{i, 1}^{l} + (1 - g (h_{i, 1}^{l}, h_{i, 2}^{l})) h_{i, 2}^{l})

(12)

2.4. Feature Extraction Layer

The feature extraction layer is responsible for feature extraction for the visual and text attributes of the person entity. For the visual modal facial information, the feature extraction layer uses the cascade convolution network and the pretrained face recognition model to extract facial features, and for the text modal attribute information, the feature extraction layer directly uses the feature extraction interface of the natural language processing pretraining model to obtain the semantic features of the entity attributes.

When extracting the facial features of a person entity, the feature extraction layer first uses the Base64 library to convert the entity binary description image in the dataset into the character coexisting JPG format. Secondly, it uses the three-stage cascade convolution network to detect whether there is facial information in the entity description image and cut it. Finally, it uses the pretraining model with Se-LResNet101E-IR-152 as the network backbone and MS-Celeb-1M [12] to extract facial information features.

When the three-stage cascade convolutional network [13] is used for face detection, it first reshapes the size of the input image and generates the image pyramid so that the model can adapt to the input image of any size. Secondly, the image pyramid is input to ProposalNet to preliminarily generate candidate boxes and boundary boxes, and then the candidate boxes are input to RefineNet to filter the poor candidate boxes. The final input uses a more supervised and complex OutputNet to obtain the final output. The core idea of the cascade convolution network is to use a lightweight model to generate a low probability and large number of target candidates and then input a more complex convolution model for high-precision regression and output, so as to ensure a high accuracy and greatly improve the detection speed.

The cascade convolution network uses three tasks of face classification, boundary box regression and face marker location to train the convolution neural network detector. For the task of face classification, the cross entropy loss is used as the loss function, while for the boundary box regression and face marker point location, the Euclidean loss is used:

L_{i}^{\det} = - y_{i}^{\det} l o g (p_{i}) + (1 - y_{i}^{\det}) (1 - l o g (p_{i}))

(13)

L_{i}^{b o x} = {‖ {\hat{y}}_{i}^{b o x} - y_{i}^{b o x} ‖}_{2}^{2}

(14)

L_{i}^{l a n d m a r k} = {‖ {\hat{y}}_{i}^{l a n d m a r k} - y_{i}^{l a n d m a r k} ‖}_{2}^{2}

(15)

where

p_{i}

represents the probability that the candidate sample is a truly human face,

y_{i}^{\det}

represents the standard index,

{\hat{y}}_{i}^{b o x}

represents the regression coordinates obtained from the network,

y_{i}^{b o x}

represents the standard confidence coordinates,

{\hat{y}}_{i}^{l a n d m a r k}

represents the coordinates of the face marker points obtained from the network and

y_{i}^{l a n d m a r k}

represents the corresponding standard confidence coordinates.

When using the multi-task training model, different kinds of training pictures will be used, and the above loss functions will not be used at the same time. In this case, the overall learning objective can be expressed as:

\min \sum_{i = 1}^{N} \sum_{j \in {\det, b o x, l a n d m a r k}} α_{j} β_{i}^{j} L_{i}^{j}

(16)

where N is the number of training samples and

α_{j}

is a parameter indicating the importance of the task.

β_{j}

is a sample type indicator to indicate different types of input samples.

During face detection, the feature extraction layer will conduct additional training through the face dataset created by us to obtain the appropriate confidence threshold, b. If multiple faces are found after face detection on the attributes of the character description image or the face confidence is less than b, it is considered that there is no facial information that accurately describes the person entity in the image.

The cascade convolution network outputs the coordinates and confidence of the detected facial information. The feature extraction layer intercepts the original image through the coordinates to obtain more accurate entity facial information and then extract the facial features using InsightFace [14] pretrained by the additive angle interval loss function (ArcFace Loss) based on MS-Celeb-1M with SE-LResnet101E-IR as the backbone of the network. Using the ArcFace loss function shown in Formula (16) as the training target, the feature embedding can be transferred to the hypersphere plane with radius s, and the additive angle distance penalty can be added to expand the inter-class difference and compact the intra-class distance, which significantly reduces the training time while ensuring detection accuracy.

L_{A rcFace} = - \frac{1}{N} \sum_{i = 1}^{N} l o g \frac{e^{s \cos (θ_{y_{i}} + m)}}{e^{s \cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{n} e^{s \cos θ_{j}}}

(17)

When extracting the semantic features of person entity text attributes, the feature extraction layer directly uses the feature extracting interface of bert-base-chinese. After acquiring the attribute semantic features, the feature extraction layer aggregates the facial features, semantic features and structural features through feature splicing to obtain the final person entity representation. During the training and verification of the actual model, considering that the facial information is a sparse attribute in the person entity, if the facial feature extraction layer does not obtain the face of the entity due to the problem of image clarity or the lack of a description image attribute, the facial feature extraction layer will splice the random features of the same dimension as the face to improve the robustness and generalization ability of the model.

3. Experiment and Discussion

3.1. Datasets

In order to measure the effectiveness of the person entity alignment method based on multimodal information aggregation, this study analyzed open-source semi-structured encyclopedia data based on face detection technology and crawler technology to build a multi person entity alignment dataset. First, the Baidu Encyclopedia and Sogou Encyclopedia were crawled on a large scale to build the knowledge bases of the Baidu Encyclopedia and Sogou Encyclopedia, respectively. Secondly, the face detection mechanism was used to detect whether there was a face in the description image of the encyclopedia entries to obtain the character entries. Thirdly, through analyzing the attribute and relationships of the character entries and aggregating the relationships between the person entities and the attributes into 19 types of relationships, we built a multi-person relation graph. Finally, the multi-person entity alignment dataset was constructed by data cleaning.

This experiment mainly used the multimodal person entity alignment dataset crawled, analyzed and constructed from the Baidu Encyclopedia and Sogou Encyclopedia, including 23,512 entities and 59,691 triplets. Table 1 gives the statistical data of the dataset.

To verify the generalization ability of the model, we also used the traditional unimodal entity alignment dataset DBP15K [15] to evaluate the model. The basic statistical information of the DBP15K dataset is shown in the Table 2.

3.2. Configuration

This study conducted experimental research based on the TensorFlow 2.0 deeplearning framework. The compilation environment was Python 3.7.11 (Wilmington, DE, USA), and the operating system was Ubuntu 18.04 (London, UK). The experimental hardware was configured with Intel (R) Xeon (R) gold 6132 2.60 GHz CPU (Santa Clara, CA, USA), 256 GB memory and NVIDIA Geforce 3090 24 GB GPU (Santa Clara, CA, USA).

3.3. Evaluation Index

In this paper, we used Hits@n, mean reciprocal rank (MRR) and mean rank (MR) to objectively evaluate the entity alignment accuracy of model. The larger the Hits@n and MRR and the smaller MR, the better the performance of the model.

Hits@n represents the probability that the top n items of the candidate entity alignment possibility rank have correct results, MRR represents the average of the reciprocal of correct ranking in the candidate alignment, and MR represents the average correct ranking in the candidate alignment. The calculation formulas are:

H i t s @ n = \frac{1}{| S |} \sum_{i}^{| S |} | | (r a n k_{i} \leq n)

(18)

M R R = \frac{1}{| S |} \sum_{i}^{| S |} \frac{1}{r a n k_{i}}

(19)

M R = \frac{1}{| S |} \sum_{i}^{| S |} r a n k_{i}

(20)

where S is the total triplet set,

r a n k_{i}

is the entity alignment prediction ranking of the

i th

triplet and

‖

is the indicator function.

3.4. Results and Analysis

In order to verify the effectiveness of aggregating multi-features, applying the dynamic graph attention network and using the layer-wise gated network, this paper compares PEAMA with AliNet on same datasets. The experimental results are shown in Table 3. The best results are marked in bold and the second-best results are underlined.

It can be seen from Table 3 that in the multimodal person entity alignment dataset constructed by Baidu Encyclopedia and Sogou Encyclopedia crawling, when PEAMA is compared with the AliNet, Hits@1 increased by 12.4% and MR decreased by 32.796, which proves the positive effect of applying the dynamic graph attention network and the layer-wise gated network and the effectiveness of multi-information aggregation such as facial, semantic and structural features. In order to demonstrate the effect of the aggregation of multi-information, the application of the dynamic graph attention mechanism and the layer-wise gated network on the alignment of person entities, this study conducted experiments on the model using only the facial similarity comparison, the application of the dynamic graph attention mechanism and the layer-wise gated network without the aggregation of multi-information. However, a gap remains in the experimental results with regard to stitching multi-information. When only using facial information for entity alignment, the effect was not good because of the lack of visual facial attributes and the neglect of attribute and relationship triplets.

In order to further demonstrate the effectiveness of applying the dynamic graph attention mechanism and the layer-wise gated network, this study applied PEAMA without the feature extraction layer to the traditional single-mode entity alignment dataset DBP15K, and the results are shown in Table 4. On the traditional single-mode entity alignment dataset, the indexes of PEAMA are improved, to a certain extent, compared with the original model, which proves the positive effect of applying the dynamic graph attention mechanism and the layer-wise gated network to improve the entity alignment effect and the universality of PEAMA.

3.5. Complexity Analysis

The time complexity of the traditional graph attention network is

O (| γ | d d^{'} + | ε | d^{'})

, where

γ

is the node set,

ε

is the edge set,

d

is the input dimension of the graph attention layer and

d^{'}

is the output dimension of the graph attention layer [11].

The calculation formula for the attention coefficient of the dynamic graph is

e_{i j} = {\vec{a}}^{T} L e a k y R e L U ([W_{1} {\vec{h}}_{i} ‖ W_{2} {\vec{h}}_{j}])

. First, it takes

O (| γ | d d^{'})

to calculate

W_{1} h_{i}

and

W_{2} h_{j}

for each node i and j in the node set, then it takes

O (| ε | d^{'})

to calculate the splicing and

L e a k y R e L U

activation functions with

W_{1} h_{i}

and

W_{2} h_{j}

, and finally it takes

O (| ε | d^{'})

to calculate the linear layer,

a

. The final total time complexity is the same as that of the traditional graph attention network, which is

O (| γ | d d^{'} + | ε | d^{'})

. However, this model only needs to learn and optimize the feedforward neural network

{\vec{a}}^{T}

and the weight matrices

W_{1}

and

W_{2}

when applying the dynamic or static graph attention network, so they have the same spatial complexity [11].

In the actual code implementation of this model, in order to facilitate the splicing of matrices

W_{1} h_{i}

and

W_{2} h_{j}

, the matrix is expanded by

| γ |

times before splicing, where

| γ |

is the total number of nodes in the entire dataset. As shown in Table 5, since the total number of nodes

| γ |

in the dataset is significantly larger than the input and output dimensions

d

and

d^{'}

and cannot be ignored, the expansion operation will occupy part of the calculation time and increase the training time. In the next step, the training time can be significantly reduced by optimizing the code implementation process.

4. Conclusions

Considering the problems that there are few large-scale multimodal person relation graphs and person entity alignment datasets at present and it is difficult for multimodal person entities to directly use multimodal pretraining models for multimodal joint representation, this study analyzed and crawled based on network open-source semi-structured data from different sources, thereby creating two multimodal person relation graphs containing human face image information. Inspired by the fact that AliNet alleviates the non-isomorphism problem of the corresponding entity neighborhood structure in an end-to-end manner, a person entity alignment method based on multi information aggregation is proposed. This method first aggregates and models the single-hop neighborhood of the entity by using the graph convolution network and then optimizes the static graph attention network of the original model by using the dynamic graph attention network to obtain multi-hop node attention coefficient and modeling. Thirdly, the layer-wise gated network is used to optimize the original gating mechanism to aggregate the outputs of the graph convolution network and the dynamic graph attention network and generate the entity’s structural features. Finally, the cascade convolution network, pretrained SE-LResnet101E-IR model and bert-base-chinese are used to process the entity’s visual and text attributes for feature extraction, and the facial and semantic features are aggregated with the structural features to generate the final entity representation, with the similarity calculation and entity alignment then being carried out.

Since the face attribute of a person is a sparse attribute in the person relation graph, when a person entity lacks facial information, we tried to fill in this gap with random features or all zero features of the same dimension, and we designed a contrast experiment, with the experimental results showing that the effect of using random features is better. As a result, this method has certain requirements for the multi-quality of the pre-aligned graph, and if the facial information in the pre-aligned person entity dataset is too sparse, the accuracy may be reduced. But the experimental results on the single modal dataset DBP15K also show that this does not affect the method’s effect on traditional unimodal entity alignment tasks. In the follow-up work, we hope to optimize the learning ability of the traditional graph convolution network to the single-hop node features of entities and, at the same time, model the relationship in advance to introduce external knowledge to improve the alignment effect of the model and improve the facial feature extraction effect to target the facial information of different ages.

Author Contributions

Conceptualization, H.W. and R.H.; methodology, H.W.; software, H.W.; validation, H.W., R.H. and J.Z.; formal analysis, H.W.; investigation, H.W.; resources, H.W.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and R.H.; visualization, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Natural Science Foundation of China, grant number 62002384.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The multimodal person entity alignment dataset which we made has been open-sourced at https://github.com/Ccccitrus/Multi-modal-person-entity-alignment-dataset- (accessed on 2 September 2022). DBP15K can be downloaded at https://github.com/nju-websoft/JAPE (accessed on 2 December 2017).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
Wang, Z.; Lv, Q.; Lan, X.; Zhang, Y. Cross-lingual knowledge graph alignment via graph convolutional networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 349–357. [Google Scholar]
Zeng, K.; Li, C.; Hou, L.; Li, J.; Feng, L. A comprehensive survey of entity alignment for knowledge graphs. AI Open 2021, 2, 1–13. [Google Scholar] [CrossRef]
Wang, M.; Wang, H.; Qi, G.; Zheng, Q. Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph. Big Data Res. 2020, 22, 100159. [Google Scholar] [CrossRef]
Chen, L.; Li, Z.; Wang, Y.; Xu, T.; Wang, Z.; Chen, E. MMEA: Entity alignment for multi-modal knowledge graph. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Hangzhou, China, 28–30 August 2020; Springer: Cham, Switzerland, 2020; pp. 134–147. [Google Scholar]
Sun, Z.; Wang, C.; Hu, W.; Chen, M.; Dai, J.; Zhang, W.; Qu, Y. Knowledge Graph Alignment Network with Gated Multi-Hop Neighborhood Aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 222–228. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3844–3852. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Brody, S.; Alon, U.; Yahav, E.; Brody, S.; Alon, U.; Yahav, E. How Attentive are Graph Attention Networks? In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; pp. 1–10. [Google Scholar]
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 87–102. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded ConvolutiSonal Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef] [Green Version]
Deng, J.; Guo, J.; Yang, J.; Xue, N.; Cotsia, I.; Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4690–4699. [Google Scholar]
Sun, Z.; Hu, W.; Li, C. Cross-Lingual Entity Alignment via Joint Attribute-Preserving Embedding; Springer: Cham, Switzerland, 2017; pp. 3–10. [Google Scholar]
Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling Relational Data with Graph Convolutional Networks. In Proceedings of the Semantic Web, Monterey, CA, USA, 8–12 October 2018; Springer: Cham, Switzerland, 2018; pp. 1–13. [Google Scholar]

Figure 1. An example of a multimodal person relation graph.

Figure 2. The model architecture of PEAMA.

Table 1. Multimodal person entity alignment dataset statistics.

Data Source	Language	Entities	Relations	Triplets
Baidu Encyclopedia	Chinese	14,226	19	38,716
Sogou Encyclopedia	Chinese	9286	19	20,975

Table 2. The statistics of DBP15K.

Datasets	Language	Entities	Relations	Triplets
DBP15K_ZH-EN	Chinese	66,469	2830	153,929
DBP15K_ZH-EN	English	98,125	2317	237,674
DBP15K_JA-EN	Japanese	65,744	2043	164,373
DBP15K_JA-EN	English	95,680	2096	233,319
DBP15K_FR-EN	French	66,858	1379	192,191
DBP15K_FR-EN	English	105,889	2209	278,590

Table 3. Comparison of the effects of PEAMA and AliNet.The best results are marked in bold and the second-best results are underlined.

Model	Baidu-Sogou Person Entity Alignment Dataset
Model	H@1	H@10	MR	MRR
Only face similarity comparison is used	0.281	0.573	203.106	0.356
AliNet	0.612	0.706	59.297	0.655
PEAMA (without aggregating multimodal information)	0.623	0.714	59.093	0.661
PEAMA (only aggregating semantic feature)	0.654	0.755	48.427	0.692
PEAMA (only aggregating face feature)	0.711	0.811	34.996	0.751
PEAMA	0.736	0.832	26.501	0.771

Table 4. Experimental effect of PEAMA on DBP15K and comparison with other entity alignment models. The best results are marked in bold and the second-best results are underlined.

Model	DBP15KZH-EN			DBP15KJA-EN			DBP15KFR-EN
Model	H@1	MR	MRR	H@1	MR	MRR	H@1	MR	MRR
GCN	0.487	-	0.559	0.507	-	0.618	0.508	-	0.628
GAT	0.418	-	0.508	0.446	-	0.537	0.442	-	0.546
R-GCN [16]	0.463	-	0.564	0.471	-	0.571	0.469	-	0.570
AliNet	0.547	282.760	0.628	0.549	362.110	0.633	0.556	276.225	0.644
PEAMA	0.562	278.372	0.644	0.551	341.427	0.634	0.558	267.285	0.645

Table 5. Comparison of average training times of single epoch.

Model	Epoch Average Time
AliNet	4.3329 s
PEAMA	7.2569 s

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Huang, R.; Zhang, J. Person Entity Alignment Method Based on Multimodal Information Aggregation. Electronics 2022, 11, 3163. https://doi.org/10.3390/electronics11193163

AMA Style

Wang H, Huang R, Zhang J. Person Entity Alignment Method Based on Multimodal Information Aggregation. Electronics. 2022; 11(19):3163. https://doi.org/10.3390/electronics11193163

Chicago/Turabian Style

Wang, Huansha, Ruiyang Huang, and Jianpeng Zhang. 2022. "Person Entity Alignment Method Based on Multimodal Information Aggregation" Electronics 11, no. 19: 3163. https://doi.org/10.3390/electronics11193163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Person Entity Alignment Method Based on Multimodal Information Aggregation

Abstract

1. Introduction

2. Materials and Methods

2.1. Graph Convolution Layer

2.2. Dynamic Graph Attention Layer

2.3. Layer-Wise Gated Layer

2.4. Feature Extraction Layer

3. Experiment and Discussion

3.1. Datasets

3.2. Configuration

3.3. Evaluation Index

3.4. Results and Analysis

3.5. Complexity Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI