1. Introduction
Learning calligraphy encapsulates the finest understanding and continuation of traditional Chinese culture. However, the resources for calligraphy education remain relatively limited, and the pressing issue of an insufficient number of specialized calligraphy instructors is conspicuous. Calligraphy evaluation holds the potential to spark enthusiasm for calligraphy learning, yet it faces challenges such as a scarcity of renowned mentors, low evaluation efficiency, and disparities in evaluation criteria. Influenced by the rapid advancement of computer and information technology, traditional calligraphy is becoming digitized in terms of creative tools, artistic effects, writing techniques, and creative methodologies, culminating in the emergence of the interdisciplinary domain of digital calligraphy research. Digital calligraphy research primarily comprises three facets: (1) digitized modeling of calligraphy tools; (2) analysis and processing of calligraphy images; and (3) representation and synthesis of calligraphic characters. These three directions within digital calligraphy research focus on a shared concern: the computational aesthetics and evaluation of calligraphy. Intelligent aesthetic evaluation poses a challenging problem within the field of artificial intelligence. Current efforts in aesthetic evaluation primarily focus on aesthetic ratings of photographic images [
1], aesthetic evaluation of painted artwork images [
2], aesthetic evaluation of videos [
3], and aesthetic evaluation of web page layout design [
4]. However, there is a relative scarcity of research on computational aesthetics and evaluation of calligraphy. The existing related research predominantly branches into two directions, calligraphy evaluation based on feature engineering and calligraphy evaluation based on deep learning.
The application of feature engineering in calligraphy evaluation commonly integrates manually extracted features with classifiers. Common artificial features, based on traditional visual theories, include SIFT [
5], HOG [
6], and Gabor [
7], among others. These features, extracted from images, subsequently undergo common machine learning methods for classification, such as SVM [
8], RF [
9], KNN [
10], Xgboost [
11], and LightGBM [
12]. Han et al. [
13] introduced an interactive calligraphy guidance system, leveraging image-processing techniques to extract quantifiable features such as the center, size, and projection of each handwritten character. These features are then subjected to fuzzy inference to render evaluations of the handwritten characters. Gao et al. [
14] designed eight directional features to implement Chinese handwritten quality evaluations by analyzing confidence in online Chinese handwritten recognition. Li et al. [
15] represented the topology of Chinese characters using WF histograms and employed these topological features as inputs for calligraphy evaluation via an Adaboost ensemble composed of support vector regression (SVR). Drawing inspiration from classical calligraphy principles, Sun et al. [
16] proposed 22 global shape features. Incorporating a 10-dimensional feature vector for sparse coding to capture component layout information, these features are concatenated and input into an artificial neural network for calligraphy evaluation. Wang et al. [
17] introduced a hierarchical evaluation approach that encompasses evaluations of entire characters and individual strokes. This approach computes comprehensive weighted similarities and transforms them into final evaluation scores. In subsequent research efforts, the team employed a skeleton-based iterative nearest point algorithm to synthesize overall and stroke-level similarities, yielding ultimate comprehensive evaluation scores [
18]. Zhou et al. [
19] devised three feature criteria based on Chinese calligraphy theory for extracting features of Chinese characters from calligraphy teaching materials. Subsequently, they employed a possibility–probability distribution method to implement calligraphy evaluation. Addressing challenges inherent in resistive touch screen handwriting, Xing [
20] introduced a method for fuzzy evaluation of Chinese character strokes, which entails the construction of membership templates, the selection of fuzzy subsets, and the generation of template parameters. While engineering-designed artificial features have demonstrated favorable performance across a range of evaluation tasks, the method is constrained by limitations in feature quantity and representation scope. This often leads to the omission of nuanced evaluation information in calligraphy. Furthermore, the intricate nature of manual feature design results in substantial errors and limited applicability, thus constraining the effectiveness of evaluation models.
Significant progress has been achieved in image aesthetic quality evaluation through deep learning methods. For instance, in the context of aesthetic quality evaluation focused on landscape photographs in the AVA dataset, Zhang et al. [
21] devised a multi-modal self-cooperative attention network (MSCAN). Meanwhile, Li et al. [
22] utilized a deep Siamese network to introduce a multi-task deep learning framework for personalized and general image aesthetic evaluation. Due to the lack of public datasets for calligraphy evaluation, there is less research on calligraphy evaluation based on deep learning than in image aesthetic quality evaluation. In the field of calligraphy evaluation, Qiang et al. [
23] proposed a method for categorizing the aesthetic quality of hard-pen calligraphy works by students using a CNN-based approach in 2019, achieving commendable accuracy. However, a continuous and detailed aesthetic scoring system, along with the interpretability of such scores, remains lacking. In 2022, Sun et al. introduced a framework for calligraphy aesthetic evaluation through the utilization of a Siamese network architecture coupled with transfer learning, primarily focusing on brush calligraphy evaluation [
24]. In these deep learning-based calligraphy evaluation methodologies, evaluation features are mainly extracted from calligraphy images, overlooking the influence of brushstroke dynamics during calligraphy creation. Xu et al. [
25] employed a TLD tracking algorithm to extract brush movement trajectories from video streams and proposed the MCNN-LSTM model to derive scores of handwriting quality by comparing them with reference templates. Similarly, Wang et al. [
26] collected sequential data of brushstroke creation using a nine-axis sensor, and proposed a combination of long short-term memory networks with K-nearest neighbor algorithms for calligraphy evaluation through the analysis of writing motion data. Deep learning methods have the potential to enhance the efficiency, robustness, and reliability of automated calligraphy evaluation. However, it is important to note that most existing evaluation research focuses on brush calligraphy, with limited exploration into the evaluation of hard-pen calligraphy.
With the ongoing development of our era, the hard-pen has gradually evolved into the primary writing instrument for modern work and life, and hard-pen calligraphy has emerged as a branch of traditional calligraphy. Distinguished from brush calligraphy by its writing tool, hard-pen calligraphy exhibits distinct aesthetic characteristics in its creations. It readily highlights the beauty of character form and structure but exhibits relative limitations in stroke formation and expressive penmanship. Capitalizing on this attribute, this paper primarily integrates Chinese character recognition and similarity calculation methods, applying deep learning models to automate hard-pen calligraphy evaluation. This endeavor seeks to achieve an automated evaluation of hard-pen calligraphy, as illustrated in the overall structure of the evaluation model depicted in
Figure 1.
The main contributions of this paper are as follows:
- (1)
Establishment of a dataset specifically designed for evaluating hard-pen Chinese character writing. Key features of the dataset include: 100 volunteers transcribing six randomly selected Chinese characters to reduce random errors; collection of handwriting images from volunteers with different styles to ensure sample diversity; application of appropriate preprocessing techniques to enhance image quality; and annotation of bounding boxes to enrich the dataset. The dataset caters to the experimental requirements of hard-pen calligraphy evaluation models.
- (2)
Utilization of YOLOv5 with a streamlined CSPDarknet53 as the backbone feature extraction network enabled precise detection and recognition of Chinese characters, aligning each character with its corresponding standard template. This robust foundation significantly enhances the subsequent evaluation processes.
- (3)
Adoption of a Siamese network architecture as the central framework, complemented by an improved VGG16 as the primary feature extraction network. The extracted feature maps are then fed into a Transformer structure to capture interrelationships, thereby enhancing the model’s feature extraction capabilities. The incorporation of similarity metrics enables the realization of hard-pen calligraphy evaluation.
2. Proposed Dataset
Currently, deep learning-based network models rely primarily on data-driven approaches. Existing publicly available Chinese character datasets are mainly used for offline or online handwritten Chinese character recognition tasks. The HCL2000 dataset [
27], an offline, handwritten simplified Chinese character dataset, encompasses 3755 frequently used primary-level characters. It contains handwriting samples contributed by 1000 participants from various age groups, genders, educational backgrounds, and occupations. Another dataset, HCD [
28], is also an offline primary-level Chinese character dataset. It is categorized into ten subsets based on the quality of handwriting and contains samples written by 1999 participants, doubling the size of HCL2000 dataset. The CASIA-HWD dataset [
29] includes samples of handwritten Chinese characters and texts, comprising 7185 Chinese characters and 171 English letters, digits, and symbols, gathered from 1020 individuals. This dataset is applicable for both offline and online handwritten Chinese character recognition tasks. However, there is currently a lack of publicly available datasets specifically designed for evaluating hard-pen writing, necessitating the creation of such a dataset. To facilitate network training and enhance model accuracy, annotation of the dataset is essential. Moreover, to improve the generalization capability of detection and recognition models, data augmentation is required.
2.1. Data Acquisition and Preprocessing
A total of six Chinese characters were randomly selected from the first-level Chinese character database, and 100 individuals were invited to write these chosen characters. The chosen characters were “爱” (love), “你” (you), “中” (middle), “华” (China), “天” (sky), and “下” (under), with each character garnering 100 writing samples. To better suit mobile application scenarios, images were captured using smartphones, resulting in a total of 600 photographs saved in JPEG format. During the capture process, the resolution of the sample images initially proved insufficient, resulting in unclear outlines of the character strokes. This adversely affected the quality of the features extracted by the evaluation model, consequently leading to a decline in model evaluation performance. Conversely, excessively high resolution not only failed to significantly improve model evaluation performance but also incurred substantial computational costs. Following testing, the optimal resolution for the captured sample images was determined to be 880 × 1230. To simulate real-world application scenarios, the lighting conditions during the sample capture process were not standardized. However, variations in lighting conditions could impact the display of Chinese characters and background in the sample images. Thus, preprocessing of the collected sample images was deemed necessary to ensure effective model evaluation.
Preprocessing operations serve dual objectives. Firstly, they facilitate data augmentation, thereby creating a dataset with diverse characteristics to simulate real-world influences. This process ensures sample diversity, ultimately enhancing the model’s generalization capabilities. Various types of noise, including Random noise, Gaussian noise, and Salt and Pepper noise, are intentionally introduced to the collected dataset to achieve this augmentation. Secondly, preprocessing aims to mitigate adverse effects stemming from environmental conditions, thereby improving the accuracy of model training. To facilitate subsequent recognition and processing, a thresholding method is employed to binarize Chinese character images, effectively removing interference caused by factors such as lighting variations and shadows. The thresholding techniques utilized encompass single thresholding, adaptive thresholding, and OTSU thresholding [
30], as depicted in
Figure 2. Among them, OTSU thresholding yields suboptimal results, primarily due to the presence of large shadow regions, which significantly impact character evaluation. Conversely, both single thresholding and adaptive thresholding demonstrate satisfactory performance, with single thresholding exhibiting notable speed advantages. To optimize preprocessing speed, the single thresholding method is selected to binarize sample images, with a segmentation threshold set at 50. Furthermore, erosion and dilation operations are conducted to eliminate residual noise post thresholding and to mitigate minor noise. To address computational errors resulting from variations in character sizes, the dataset undergoes cropping to remove excess background and standardizing the images to a uniform size. The outcomes of the sample preprocessing procedures are depicted in
Figure 3.
2.2. Data Annotation
In accordance with the training requirements of the Chinese character detection and recognition model, the LabelImg software (version 1.8.1) was employed to perform category bounding box annotation on the sample images. The chosen format for annotation boxes was YOLO-style, involving the utilization of rectangular boxes to encompass the target objects. After selecting these boxes, the specific target category was assigned, leading to the creation of a corresponding label file. These label files followed a text document format, containing relevant information such as category indices, central coordinates, and relative dimensions of the annotated boxes.
Figure 4 visually depicts the process of data annotation. The Chinese character samples were organized in the “images” directory, with the corresponding label files stored in the “labels” directory. This arrangement resulted in a total of 4800 samples paired with 4800 corresponding labels. Additionally, to facilitate the training of the Siamese network within the calligraphy evaluation model, distinct categories of Chinese characters were segregated into individual folders. During the training process, a random subset of 20 samples from each character category was selected from the collected dataset.
3. Proposed Methodology
This paper introduces an intelligent calligraphy evaluation approach based on deep learning, with the overall process outlined in
Figure 5. First, the evaluation dataset is established, which involves data collection and preprocessing. Subsequently, the Yolov5 object detection network is employed to detect and recognize the input evaluation samples, yielding standardized reference samples for the evaluation candidates. Next, a Siamese model is constructed, with VGG16 serving as the backbone network. By applying upsampling operations, feature extraction is enhanced. Additionally, the transformer structure is utilized to capture contextual information from the evaluation samples, thereby enhancing both feature extraction and propagation. This comprehensive process is aimed at optimizing the evaluation outcomes of the model.
3.1. Detection and Recognition Network
In this study, YOLOv5 is employed as the detection and recognition network to perform region detection and recognition on the target evaluation samples, thereby obtaining standardized samples for evaluation. YOLOv5 is a deep learning algorithm in the field of object detection [
31], which has demonstrated significant improvements in both detection speed and recognition accuracy compared to YOLOv4. The model primarily consists of three components: backbone, neck, and head, as illustrated in
Figure 6. In the backbone, key structures such as the Conv module, C3 module, and Spatial Pyramid Pooling (SPP) module are employed to process the input image through a series of convolutional and pooling layers, progressively extracting high-level feature representations from the image. The neck utilizes techniques such as convolutions at different scales, upsampling, and downsampling from the Feature Pyramid Network (FPN) to fuse feature maps from different levels, resulting in multi-scale feature representations for more comprehensive and enriched feature expression. These fused feature maps are then forwarded to the prediction layer. The head, responsible for object detection within the feature pyramid, comprises convolutional layers, pooling layers, and fully connected layers. In the YOLOv5 model, the detection head module is primarily responsible for multi-scale object detection on the feature maps extracted by the backbone network, ultimately achieving handwritten Chinese character detection and character recognition tasks.
3.2. Siamese Evaluation Network
The calligraphy evaluation network is built upon the Siamese network architecture [
32], which incorporates an enhanced VGG16 [
33] as the backbone feature extraction network. The feature maps extracted by VGG16 are then input into a Transformer structure to capture interrelationships, thereby enhancing the model’s generalization capacity. The Siamese network comprises two input channels, with identical weights and parameters across all structures within these channels. After extracting features from two input samples, two feature vectors are obtained for each sample. The similarity between the two samples is evaluated by measuring the distance between their respective feature vectors. The Siamese network can be employed to quantify the dissimilarity between two Chinese characters, making it a suitable choice as the foundational network framework for the evaluation model.
3.3. Backbone Feature Extraction Network
The calligraphy evaluation model adopts the Siamese network as the overarching structural framework, utilizing VGG16 as the backbone feature extraction network within the Siamese structure. Input images undergo convolutional processing by VGG16 to derive effective features. Images of size 256 × 256 × 3 are fed into the VGG16 network for hierarchical feature extraction. Through multiple convolutional and pooling operations, and by employing the ReLU activation function after each convolution operation to strengthen the inter-layer connections, the image features are extracted, ultimately resulting in a feature matrix with a size of 4 × 4 × 512. Stacks of smaller convolutional kernels are a key element of VGG16 [
34]. To address the issue of diminished model accuracy due to the small feature maps resulting from multiple convolution operations, an enhancement is made to the VGG16 network. A combination of upsampling and pointwise convolution is applied to the vector processed by VGG16 to augment the feature map size and alter the feature map dimensions. The improved structure is illustrated in
Figure 7. Upsampling employs bilinear interpolation [
35], utilizing interpolation algorithms to calculate new values inserted between the existing matrix elements.
3.4. Enhanced Feature Extraction Network
After image feature extraction by the backbone network, considering global image information can enhance the model’s performance and processing efficiency. Therefore, in this study, the Transformer structure [
36] is incorporated into the model framework. Since the output of the main feature extraction network consists of two feature maps, and the Transformer operates on tokens, it is necessary to transform the feature maps into token structures with high-dimensional semantic information. To represent the varying parts between the two input images using a single token, the feature maps obtained from VGG16 network processing are converted into tokens using the Semantic Tokenizer, which extracts compact semantic tokens from each temporal feature map. The processing procedure of the Semantic Tokenizer is illustrated in
Figure 8. Initially, the obtained feature maps are divided into multiple submaps, each of which is mapped to a token. To incorporate spatial information, spatial attention is employed to compress the channels and extract spatial information, resulting in compact tokens with enriched semantic information.
To obtain concise tokens, a set of spatial attention mappings is learned using the Tokenizer, aiming to concentrate the spatial feature mapping into a set of tokens. Let X1 and X2∈RHW×C represent the input dual-temporal feature maps, where H, W, and C are the height, width, and channel dimensions of the feature maps, respectively. Let T1 and T2∈RL×C denote two sets of tokens, with L being the size of the token vocabulary set. The feature map of size C × W × H is compressed into a 1 × W × H feature map, allowing the aggregation of average and maximum information along the dimension of the feature map. This aggregated information serves as the convolutional layer for extracting attention information, enabling the extraction of image attention information. This process yields L semantic groups, each representing a unique semantic concept. Subsequently, a softmax operation is applied to the HW dimensions of each semantic group, ensuring non-negative attention weights and thus obtaining spatial attention mappings. Finally, the attention mappings are employed to compute the weighted average of pixels in Xi, resulting in a compact vocabulary set of size L, referred to as semantic tokens Ti.
3.5. Encoder
As shown in
Figure 9, the encoding structure of the Transformer consists of multiple layers, each composed of multi-head self-attention mechanisms and feedforward neural networks. The input to this structure must be a sequence in time. Each time step of the sequence is processed by the CNN Encoder and then enriched with positional encoding. Every time step of the sequence is inputted into the multi-layer Transformer encoder. The self-attention mechanism captures the interdependencies between the components of the input sequence by computing self-attention for each time step. In the feedforward neural network, the model transforms each time step through two linear transformations and activation functions. After processing through the encoding structure, the sequence acquires its feature representation, which is then utilized by the decoder to generate the target image sequence.
The processed tokens
Ti (
i = 1,2) from the Semantic Tokenizer module are concatenated, yielding the input tokens represented as
X = {
x_1,
x_2, …,
x_2
L}, where
x_t denotes the feature token at time step
t. Positional Encoding incorporates a d-dimensional vector into each time step
t to enable the model to discern the distances between various time steps. This vector is defined by Equation (1):
where
i represents the position in the sequence,
j denotes the dimension in the vector, and
d is the total dimension of the vector. Subsequently,
x_t is added to
p_t, resulting in
z_t. The matrix
Z∈
R2L×C is then subject to normalization.
To enhance the processing of multi-modal data in this study, we employ a novel normalization method called “Norm”. Norm consists of two main components: Instance Normalization (
IN) and Layer Normalization (
LN).
IN is designed for normalizing individual modalities and primarily operates on the channel dimension of a single modality’s data. On the other hand,
LN is devised for normalizing multi-modal data, where it conducts normalization on the feature vectors at each position of each modality within every sample. The normalization formula for
IN is given in Equation (2) [
37], wherein
μ signifies the mean located along the channel dimension,
σ indicates the standard deviation along the same dimension, and
γ and
β are, in turn, recognized as the learnable scaling and shifting factors, respectively, within that dimension:
The normalization formula of
LN is shown in Equation (3) [
38]:
The input vectors utilizing this normalization technique comprise multiple batches, wherein
ε denotes a minute constant (typically chosen as 10
−5). Following the application of the normalization procedure, these vectors are projected into three distinct vector spaces: Query, Key, and Value. These projections are carried out through the utilization of three parameter matrices
WQ,
WK, and
WV, each belonging to the domain
Rd×L, as shown in Equation (4):
The corresponding attention scores are computed through the dot product operation between
Q and
K as follows:
Here, the exp(.) function computes the inner product between the query vector and the key vector. When calculating attention scores, the process first involves determining weights based on the similarity between the query and key vectors. Subsequently, these weights are utilized for element-wise multiplication and summation across all values. Multiple attention heads are concurrently connected to create a multi-head attention mechanism. The output, enriched with positional encodings, is concatenated with the result of the multi-head attention mechanism processing. Following normalization, this combined result is then passed through a multi-layer perceptron (MLP) in the feedforward neural network.
The role of the
MLP is to perform a non-linear transformation on the input attention vectors, thereby enhancing the model’s representational capacity. Within this study, this module also facilitates the transition between the self-attention module and the cross-modal fusion module. In other words, it carries out the following operation on the outcome derived from the multi-head attention mechanism:
Among these, W1∈RC×2C and W2∈R2C×C stand as two parameter matrices, while b1∈RC×2C and b2∈RC represent two bias vectors. Following the transformation by the MLP, a more enriched and intricate representation is obtained, leading to the enhancement of model’s performance. After the processing by multiple Encoder modules, a novel set of feature markers, Lnewi (where i = 1, 2), is derived.
3.6. Decoder
The architecture of the Decoder section, as depicted in
Figure 10, is primarily responsible for generating the output sequence, thereby mapping the token sequence back into the pixel-level space. Specifically, the Decoder initially receives a temporally informed feature as input and then proceeds to generate the output sequence progressively, utilizing a self-attention mechanism. In this study, the decoding segment incorporates a Bitemporal Attention mechanism, extending the attention mechanism across two time steps and enabling the model to simultaneously account for relationships between the two temporal instances.
In the Decoder section, given a sequence of features
Xi, the Decoder leverages the relationship between each pixel and
Tnewi to derive refined features
Xnewi. The initial step involves employing the Self-Attention mechanism to learn representations for each temporal instance. Its calculation is illustrated in Equations (7) and (8):
In these two expressions above, the parameter
headi denotes the output of the
i-th attention head, while the parameter matrices for the
i-th attention head are represented by
WiQ,
WiK,
WiV, and the output matrix is denoted as
WO. Subsequently, these are fed into the Bitemporal Attention mechanism to incorporate information across different time steps. Its expression is shown in Equation (9).
Following the processing through the attention mechanism, the data are subsequently directed into a feedforward neural network for nonlinear processing, ultimately resulting in the refined features Xnewi.
3.7. The Prediction Network
In total, two novel pixel-level feature maps are derived through the initial two sections, while the prediction network obtains a new differential prediction map through a sequence of operations. The absolute value of the result after subtracting the two feature maps, as illustrated in Equation (10), corresponds to the distinctive segment between the two feature maps:
The results are input into the change classifier, which consists of two 3
× 3 convolutional layers along with BN normalization operations. After the softmax operations, the image dimensions are transformed to 2
× 64
× 64. Finally, the ultimate similarity between the two Chinese character handwriting samples is inferred from the output of the fully connected layer. Calculating the disparity between predicted outcomes and actual labels is achieved using the binary cross-entropy loss formula, expressed mathematically as follows:
where,
N represents the total count of samples employed in training. The actual label for the
i-th sample is denoted by
yi, while an alternative
yi is used to indicate the predictive outcome of the
i-th sample. Within this loss function, when
yi = 1, the focus of the loss function is specifically on the portion where the model predicts a positive class. Conversely, when
yi = 0, the loss calculation solely takes into account the segment pertaining to the negative class outcomes. Throughout the training process, optimization of the model’s evaluation of Chinese character handwriting samples is achieved by progressively reducing the loss function.
5. Conclusions
The proposed intelligent calligraphy evaluation model, based on deep learning with the Siamese structure as its fundamental framework, achieves the evaluation of hard-pen writing by calculating the similarity between the samples under evaluation and standard samples. The model utilizes an enhanced VGG16 as the backbone feature extraction network and introduces a Transformer structure to enhance the learning of interrelationships between feature maps, thereby improving the model’s performance. Experimental results demonstrate that the Siamese structure is effective for the comparative evaluation of writing samples in hard-pen calligraphy. Additionally, the Transformer structure significantly enhances the evaluation model’s performance.
Considering the challenges posed by the weak representation of stroke forms and brushwork in hard-pen writing, the model assesses the quality of hard-pen Chinese character writing based on the overall feature similarity. This approach provides a novel method for the automatic evaluation of hard-pen calligraphy. Specifically designed for hard-pen calligraphy assessment, the model acquires its feature extraction capability through learning from the hard-pen calligraphy evaluation dataset, demonstrating strong specificity and adaptability. However, the model is subject to constraints, including the size of the training dataset, which is relatively small, and the limited diversity of the evaluation features.
Future research could address these limitations in two ways. Firstly, by expanding the scale of the handwriting evaluation dataset, increasing the number of Chinese characters covered by the samples, and incorporating standard writing samples from renowned hard-pen calligraphers. Secondly, by integrating dynamic data from the process of hard-pen calligraphy writing as a dimension of features into the similarity evaluation, enriching the features for hard-pen calligraphy evaluation, and better reflecting individual differences in the process of hard-pen writing.