5.1. Datasets
We conduct experiments on two datasets, SemTab2019 [
34] and HardTables2022 [
35], which include annotations for both column types and inter-column relationships. Both datasets contain vertical relational web tables from Wikipedia with semantic annotations at various levels of detail. The average number of rows in tables from HardTables2022 is much smaller than that in SemTab2019, making annotation more challenging.
Table 2 presents a summary of basic statistics of two datasets. Due to the absence of a manually annotated dataset for column NER task, we use the NER annotation tool spaCy [
36] to identify entity types of all cells in each column, assigning the most frequent entity type as the column entity type. In accordance with prior research practices, all table headers are excluded. With no overlap among tables, 10% of the samples are randomly chosen as test set, while the remaining data are divided into five folds for cross-validation.
To evaluate model performance, we employ two types of F1 score metrics: micro-average F1 score and macro-average F1 score. The micro-average F1 score considers all instances across classes to compute an overall average, emphasizing performance on common classes. Conversely, the macro-average F1 score treats all classes equally by averaging F1 scores of each class, providing a balanced view of performance across all classes, regardless of their frequencies.
5.3. Results and Analysis
Overall performance. Our model is compared with competitive baselines, including TaBERT [
7], TABBIE [
8], DODUO [
40]. The overall comparison with baseline models is outlined in
Table 3 and
Table 4, which is based on average performance of 5-fold tests. The experimental results indicate that our model achieves outstanding performance, with two key metrics of 82.4%/63.6% and 72.6%/62.8% on Semtab2019 dataset, and 86.3%/73.2% and 61.5%/51.9% on HardTables2022 dataset, significantly outperforming the baseline models. This highlights the model’s robust feature extraction capabilities and its high adaptability to task requirements. Comparing the improvements in two F1 scores, our model shows valuable enhancement in macro F1, indicating that our multi-task learning framework can strengthen the learning ability for small sample categories. Furthermore, our model’s performance compared to DODUO suggests that its architecture is more apt for table annotation tasks, enabling more suitable alignment of inter-task correlations and realizing real task information interaction. Unlike many tabular understanding models, such as TaBERT and TABBIE, which require pre-training on extensive data, our model excels by employing a refined model design and a multi-task learning strategy, achieving exceptional results with training solely on target datasets. The collaborations between three tasks contribute to a substantial improvement of overall performance.
We illustrate the training losses on SemTab2019 during training process in
Figure 3. Notably, the training losses for the primary tasks of CTA and CRA exhibit closely aligned descending trends, whereas the training loss for NER task displays a slightly slower decrease. The phenomenon can be attributed to our prescribed learning rate. This observation underscores the importance of flexibly adjusting training strategies during training process to accommodate varying task complexities and align with model’s requirements. Our training strategy, characterized by its simplicity and efficacy, ensures balanced process across diverse tasks. It allows for appropriate adjustments based on each task’s characteristics, thereby optimizing the training process, promoting overall model performance, and proficiently meeting the demands of complex real-world applications.
Ablation analysis. To analyze the distinct roles of different subtasks within our model, we conduct ablation experiments and present results in
Table 3 and
Table 4. The “target only” indicates training our model solely on a single target task, while the remaining three ablation experiments involve training with the exclusion of corresponding subtasks. On the Semtab2019 dataset, training solely on a single target task results in a performance decrease of 4.3%/10.4% and 2.9%/6.0% for CTA and CRA tasks, respectively, compared to full training. The introduction of NER task promotes both CTA and CRA tasks, proving the efficacy of incorporating entity information. Collaboration between CTA and CRA tasks is evident in their mutual performance enhancement, particularly in increasing macro average F1 scores. On HardTables2022 dataset, where most tables contain numerical data with fewer entities, the influence of NER is considerably reduced. Despite this, CTA continues to significantly benefit CRA. Additionally, notable augmentation in macro F1 compared to micro F1 highlights model’s proficiency in boosting classification accuracy for categories with fewer instances. These experimental results underscore the robust inter-task correlations successfully captured by our model design, which greatly reinforces overall performance.
The number of input rows. To investigate the impact of maximum number of input rows for tables, we conduct experiments on SemTab2019, which contains many large tables. As detailed in
Table 5, “
n row” denotes inputting only the first
n rows of table, while “max length” refers to truncating table sequence according to model’s maximum input length. Experimental results indicate that our model maintains satisfactory performance, even when utilizing a reduced number of input rows, highlighting its capability to effectively comprehend and utilize input data. This adaptability underscores its suitability for scenarios involving small-scale tables.
The size of training data. Figure 4 illustrates our model’s training outcomes across varying data volumes of SemTab2019, aimed at evaluating its learning efficacy with reduced training data. The training data for each class is uniformly and randomly partitioned into subsets of different sizes for training, while test set remains unchanged. Significantly, a notable decline in performance is observed when training data volume drops below 60%, primarily due to the skewed distribution of samples among categories within the dataset. However, training with 80% of the dataset resulted in only a marginal decrease in micro F1 scores for CTA and CRA tasks by 2.1% and 3.8%, respectively. These findings underscore model’s robustness in achieving relatively acceptable performance under conditions with limited yet balanced samples.
Compared with weighted loss training. During the training process of multi-task learning, we employ a cyclic strategy, sequentially training each task in each epoch. For comparison, we attempt another strategy, which is to train using weighted losses for multiple subtasks. In this scenario, training samples need to be simultaneously labeled for both CTA and CRA. However, not every sample in datasets meets this condition; some tables are only labeled for column types, while some are only labeled for inter-column relationships, such as HardTables2022 with approximately 12% of data being non-overlapping. For weighted loss training, we experiment with various weight combinations, among which the combination (0.1, 0.4, 0.5) has the best results, with a learning rate set at 2 × 10−5. The performance on HardTables2022 are 84.6%/68.4% and 57.2%/45.2%, lower than those achieved using our cyclic training strategy. This indicates that adopting an alternating training method for subtasks offers certain advantages in cases of task sample imbalance.
Case study. To more intuitively show the effect of our model, we present a case study in
Figure 5, comparing the results of two samples from HardTables2022 dataset. In the first case, it can be observed that TaBERT incorrectly predicts multiple columns and column relationships, where the types of purely numeric columns are difficult to distinguish. By utilizing information of NER and inter-column relationships, our model can accurately predict the relationship types of numerical columns. In the second failure case, all models are wrong in their predictions of column type (e). Specifically, TaBERT predicts “barangay”, DODUO predicts “capital city”, and our model predicts “brick and mortar”. Since “brick and mortar” and “subsidiary” are semantically similar, our model’s prediction is closer to the correct prediction. These cases demonstrate that when predicting a single column type or inter-column relationship is challenging, the interaction between column types and inter-column relationships becomes crucial. The effective annotations underscore the quality and utility of our model, validating its superiority in handling difficult annotation tasks.