3.2. Model Validation for the 15 Indexing Radicals
The identifiers were evaluated first for the validation samples that share the same font style as the training dataset. The confusion matrices with the validation results by all three models are presented in
Figure 7. The values in the main diagonal show the percentage of true positive (
) detections per class, i.e., how many of the predictions of a given class belong to that class. Besides the true positive detections, the false positive (
) and the false negative (
) predictions were also relevant because both compute the number of misclassifications. Given
,
, and
, the overall performance of the classifier model was evaluated in terms of the F-score,
where
and
[
30].
The regular-based and brush-based models could identify the radicals with average F-scores of ∼98.6% and ∼96.8%, respectively, while the handwriting-based model yielded an average F-score of ∼91.8%. These values are consistent since the identifiers know both the radicals and the font style inside the validation set.
Comparing the confusion matrices in
Figure 7, one concludes that the regular-based and brush-based models presented the best performance for the validation samples of the same font style as the one found in the training dataset. This behavior comes from the nature of the font style in each batch. Despite the particularities among different fonts, the regular category is upright and boxy, with clean strokes. The brush calligraphy, in turn, has an artistic flavor. Although some brush fonts are rigid with clean strokes, others are fluid and tend to join adjacent strokes, comparable to western cursive calligraphy. The handwriting fonts have similar behavior, aside from which the HuaKang handwriting has diverse scale factors within identical font size properties, leading to the (
) percentage drop compared to the other two models.
The wrong detection percentages are not explicit in
Figure 7, but one concludes from the scale bar that the models did not concentrate the (
) and (
) detections in any particular class, i.e., they did not misattribute the main visual features of an indexing radical to another. Indeed, a closer look showed that the false detections are evenly distributed outside the main diagonal, with values below 5%, and they are more noticeable in the handwriting model due to the calligraphy style and the inner scaling properties of the font itself. Even though the wrong classifications did not concentrate on a particular pair of actual-predicted classes, the classes “grass”, “wood”, “hand”, and “fire” were the most common outputs of the
predictions. On the other hand, “fire”, “mouth”, and “earth” are the ones with more
predictions. Both
and
misclassifications are reasonable by analyzing the sub-lexical elements of the hanzi. For example, a significant number of characters have the block “wood” (木) as bùjiàn, although the indexing radical is another (to mention a few, e.g., 沐休呆慄操條茶). Another contributing factor for the higher
and
rates in the handwriting-based identifier is the individual styles the handwriting fonts simulate, creating sub-variations for the default ones of the indexing radical block.
The model was then validated for a dataset with a font style other than the one in the training dataset to evaluate its robustness to different calligraphy styles. For instance, the regular-based model was validated with the brush and handwriting datasets, and so forth. The average F-score for all combinations of model and validation datasets is summarized in
Table 3.
Compared with the first validation step, there is a drop in the classification performance when the validation and training datasets have different fonts, even though the radicals remained the same across the datasets. The loss, however, was the steepest for the regular-based model classifying handwriting samples, with an average F-score of ∼30.9%. This behavior is mainly due to the diverse natures of both font styles: while the regular font is upright and boxy, the handwriting is artistic and flourished. Indeed, this diversity caused a similar drop for the handwriting-based model on the regular dataset, with an F-score of ∼41.8%.
The same drop, though subtler, is observed for the brush-based model on handwriting samples, with an F-score of ∼60.4%. The better performance of the brush-based model comes from the fact that the brush calligraphy is not so sharply diverse from the handwriting as the regular font is. While the regular font is always upright and boxy, the brush writing can be fluid and flourish sometimes; hence, in a sense, it is a combination of the regular and handwriting styles. Therefore, the handwriting-based model could classify the brush samples relatively well, with an F-score of ∼71.7%.
Finally, the intermediate aspect of the brush calligraphy also explains why the regular-based model had better performance on brush samples than handwriting ones, reaching an average F-score of ∼69.4%. On the other hand, the brush-based model was somewhat robust to the regular font, with an average F-score of ∼83.0%.
3.3. Performance in General Test
In the general test, the regular-based model processed the poems A–E, rendered in the PMingLiU font from the MingLiU family (
Figure 6). The font choice allows one to evaluate the proposed indexing radical identifier for a popular and unknown font style, even though the previous validation step already analyzed the three models for different calligraphy groups from the one in the training dataset. Another variable comes from the image processing operations to extract the individual characters of the poems, which add noise and shift to the input data. Moreover, since the PMingLiU font is regular and square, the test was run with the regular-based identifier alone.
The results for poem A (“The Quatrain of Seven Steps” by Cao Zhi) are shown in
Figure 8. Among the 30 hanzi in the poem, 20 are from a radical among the 15 labeled classes. The correct classifications of these 20 characters are in blue, and the misclassifications are in red. As observed, the model predicted correctly 17 out of the 20 known cases, representing an F-score of 85.0%. This behavior agrees with the values obtained in the validation stage, though there was a drop in the performance. This loss is probably due to the hanzi extraction routine. While the validation samples in the previous step are individually generated with the same parameters of the training dataset, the hanzi in the poems are extracted as contents of a bigger image, considering that any character will be inside a fixed-size square. Therefore, they are bound to suffer from offset, resolution, and cropping to fit them into the expected input size of the indexing radical identifier.
Aside from the characters with known indexing radicals, there are still ten hanzi in the poem whose indexing radicals are unknown to the classifier (in black). The actual class label of these characters is indicated with a question mark in
Figure 8. Since the model does not know these indexing radicals, the prediction would not match the actual class, and the F-score dropped to 68.0%. It is nevertheless interesting to analyze the results of this scenario with some case studies, as shown in
Figure 9.
The leftmost column shows the image input as the model sees it, and the second column indicates the actual radical (highlighted in red). The examples in
Figure 9a,b coincide with the base form of their radical, “bean” and “center”, respectively; therefore, the whole hanzi is marked in red, and the examples in
Figure 9c,d are a combination of their radical, “day” and “eye” respectively, with other strokes. Since none of these four radicals are in the training set, the model could not identify them correctly. Instead, the identification was based on other morphological traits of the character, as indicated in red in the third column. First, both hanzi in
Figure 9a,b were pointed to the radical “mouth” because they both have a square, which matches the base form of “mouth” as indicated in the last column of
Figure 9 in blue. Likewise, the example in
Figure 9c was classified based on the second form of the radical “fire” (shown in blue). Lastly, the hanzi in
Figure 9d was misclassified as the radical “wood” due to its composition. The block on the left side is “wood”, while the block on the right side is “eye”. Since the model does not recognize the radical “eye”, it concluded naturally that the character belonged to the radical “wood” instead.
Finally, the same analysis is repeated for poems B–E.
Table 4 summarizes the total of characters and the number of known radical hanzi in the poems. For each poem, the F-score was calculated considering only the known radicals first and then all the characters, as shown in
Table 5.
Disregarding the unknown radicals, the model had an average F-score of 86.0% across the five poems, which agrees with the ∼98.6% presented during the validation step. The drop from ∼98.6% to 86.0% can be explained by some reasons. For one, the indexing radical identifier has to deal with the change in the font family. Although the PMingLiU font is still regular and is similar to one of the HuaKang fonts in the training dataset, it comes from a different family and has other features, such as different lengths, thicknesses, and angles for the strokes. For another, while the images in the training dataset have high resolution, the poems had an average image resolution to evaluate the model’s robustness to noises since the latter is unavoidable in practical applications. Lastly, the training and validation samples are individually generated with well-controlled parameters, which is something the general test ruled out when preparing the input data precisely to evaluate the indexing radical identifier under more natural circumstances.
Analyzing the identifier performance over all the characters inside the poems, the model reached an average F-score of 61.4%. The performance loss is expected due to the unknown radicals. A solution for this problem is to expand the training dataset to comprise more radicals or create an extra class (e.g., “others”) to encompass the cases wherein the prediction probability for the radicals with known samples is low. Thus, instead of correlating the hanzi to a known class, the model would mark it as “others”.