1. Introduction
Tables are widely used tools for data visualization and analysis, organizing information systematically into rows and columns. They enable the structured representation of diverse data types, including numerical, textual, and Boolean values, conferring significant advantages for data storage, retrieval, and analysis. The inherent structural clarity and high information density of tables make them applicable across various domains, spanning from business and scientific research to everyday tasks. By providing organized data, tables enhance work efficiency and support higher quality decision-making.
Tables frequently co-occur with textual data, with both formats complementing each other within a given context. Many tasks require joint reasoning over data presented in these two formats. However, tables diverge from sequential natural language (NL) in that the information they convey is more succinct and features a two-dimensional structure. In contrast to the conventional challenges in natural language processing (NLP), research on tables introduces additional complexities: (1) The structural information within a table is critical, yet an optimal method for encoding this structure is presently lacking. (2) Table understanding necessitates not only surface-level semantic parsing but also the execution of diverse logical operations such as argmax, count, and average. (3) A semantic gap exists between NL and tables, which is particularly evident in the utilization of pre-trained models. The comprehension and application of tables particularly underscore the logical operations derived from tabular data. Logical operations refer to a set of operations used for reasoning and manipulating table contents, which generally involve the analysis and comparison of rows or columns across various attributes. Typical examples of logical operations include superlative, which identifies the maximum or minimum values within table; counting, which calculates the number of entries that satisfy a given condition; comparison, which compares the relationships between different entries, and so on. While the information presented in tables may be straightforward, the logical relationships and numerical connections between table records are not apparent, requiring advanced reasoning capabilities.
Pre-trained language models like BERT [
1] and GPT [
2] have exhibited notable proficiency in dealing with various free-form NL tasks. Motivated by this substantial achievement, researchers seek to extend pre-training to structured tabular data, aiming to create generalizable models that can jointly learn enhanced representations of both tabular and textual data. Most table pre-trained models focus on designing pretext tasks, encoding structural information, and synthesizing training data. However, the computational cost of table pre-training is generally much higher than that of text pre-training. Despite these efforts, table pre-trained models fail to achieve the same level of generalization as text pre-trained models and often find application limited to a specific task. For example, table pre-trained models such as TaBERT [
3] and TAPAS [
4] are only suitable for table-based question-answering tasks. Given the relatively modest performance gains achieved, their high resource consumption is not cost-effective.
We contend that existing text pre-trained models are sufficient for capturing the semantic information embedded within tables, so our focus is directed towards addressing logical reasoning based on tables. In human cognition, logical reasoning is typically performed by first categorizing the types of operations and then applying the corresponding functions. Moreover, intuitively, it should be much easier for a model to execute a single type of logical operation rather than managing various types simultaneously. Hence, we subdivide table operations based on logical types, employing distinct modules to emulate the execution of different logical operations, each independently yielding corresponding outcomes. In line with this conceptualization, we devise a universal framework named TabMoE, based on the mixture-of-experts (MoE) architecture, which is applicable across diverse types of table-related tasks. Specifically, the TabMoE encompasses multiple experts, each functioning as an agent for some specific types of logical reasoning, along with a gate responsible for selecting the appropriate expert. It is noteworthy that the logical types of samples in datasets are not annotated, so we cultivate the classification capability of the gate module in an unsupervised manner, by using a hard Expectation–Maximization (EM) algorithm.
We conduct experiments on three distinct tasks that involve higher-order tabular logical reasoning.
Figure 1 illustrates examples of these tasks: table-based question answering, table-based fact verification, and table-to-text generation. The first task is to extract or infer answers from a table according to queries, the second is to check facts in sentences against table, and the third is to generate sentences containing logical reasoning derived from table content.
Our contributions are threefold:
We propose a general framework based on the mixture-of-experts (MoE) concept to address logical reasoning challenges in table understanding. This framework exhibits applicability across a variety of table-related tasks, including both classification and generation tasks.
Our framework operates independently of table pre-trained models and demonstrates competitive performance even when trained on limited datasets.
To substantiate of our framework’s superiority, we conduct a series of empirical experiments, assessing its effectiveness, affirming its practical applicability, and delineating the additional advantages conferred by this framework.
2. Related Work
2.1. Tabular Data Representation
Following the tremendous success of pre-trained free-text representation learning in NLP, there has been increasing interest in recent years in exploring representation learning for tabular data. The basic paradigm of pre-training involves capturing general semantic knowledge from large-scale textual data through self-supervised learning, followed by fine-tuning for various downstream tasks. Table pre-training largely adheres to this approach: it begins by gathering extensive datasets of tables along with relevant contextual information (such as headers, auxiliary data, and textual content), followed by employing the appropriate pre-training tasks aimed at learning general tabular representations to improve performance on downstream applications. Research efforts made in table pre-training primarily focus on improvements in three aspects: encoding methods, model structures, and pre-training tasks. Enhancements at the input layer typically involve the incorporation of additional positional embeddings to enhance the encoding of structural information within tables, including embeddings for rows [
4], columns [
5], and hierarchical structures [
6]. Improvements in model structures often concentrate on optimizing attention mechanisms within transformers, including vertical attention [
3], tree attention [
6], and sparse attention [
7]. Advancements in pre-training tasks exhibit diversified directions. Beyond drawing inspiration from tasks in text pre-training, such as masking and corrupting, there are also classification and ranking objectives tailored to specific tasks.
Owing to the complexity of table semantic parsing, the data demands for table pre-training considerably surpass those for text pre-training. Despite substantial investments into data collection and computational resources, table pre-training has consistently lagged behind text pre-training in terms of representation effectiveness and generalization capabilities. This issue of resource inefficiency is a critical concern. Additionally, current table pre-training methods typically demonstrate effectiveness only in one or two specific downstream tasks. The high resource demand and limited application scalability have sparked ongoing debates about whether table pre-training is truly the optimal way to advance table understanding. To date, no model structure or training method has emerged as a clear leader or achieved a notable breakthrough in the field of table pre-training. Balancing model performance with reduced resource consumption, while also enhancing generalization and scalability, remains a critical challenge for further exploration in table representation learning.
2.2. MoE
The concept of MoE (mixture-of-experts) is the parallel construction of multiple modular branches as experts within a single system. The core components of MoE include three parts: experts, gating, and routing. Usually, each expert functions as an independent sub-model that specializes in processing a distinct subset of input data. The gating mechanism generates weights to determine which experts need to be activated, while the routing mechanism refers to the strategy of assigning the experts and plays a crucial role in shaping the sparsity of MoE. MoE models manifest broad applicability across diverse domains, including computer vision [
8], natural language processing [
9], and speech recognition [
10]. In the field of table understanding, SaMoE [
11] employs the MoE framework with a self-adaptive method for table-based fact verification. However, SaMoE includes a necessary step where it calculates the weighted sum of all the experts’ outputs, which makes it suitable only for classification tasks that compute probabilities, like table-based fact verification. This design limits its applicability to generative tasks, like table question-answering and table-to-text generation. In contrast, our method ensures that each expert can independently perform the task, enabling its use in both classification and generative tasks.
Depending on how weights are allocated to the experts, MoE models can be categorized into soft weighting and hard gating. Traditional soft MoE methods dynamically adjust the combination of different experts using real-valued weights. However, this necessitates the execution of all the experts, which prevents a reduction in computational overhead during inference. In contrast, hard gating produces an n-dimensional vector with only a few non-zero elements, controlling the execution of the experts. Only a minority of the experts are activated during a single use, markedly reducing computational consumption and augmenting the efficiency of MoE. Currently, hard gating is the prevailing method in MoE and is seamlessly integrated into various network architectures. For instance, HydraNet [
12] substitutes the convolutional blocks in a CNN (convolutional neural network) with multiple experts, sparely gated MoE [
13] incorporates MoE into LSTM (long short-term memory), and switch transformer [
14] integrates MoE into the transformer architecture. Recent research [
15,
16] has shifted from conceptualizing the experts as layers within a network to regarding them as independent and specialized modules, showcasing advancements in efficiency and composability. Further investigations [
14,
17] have delved into the scaling properties of the experts, including their size, quantity, frequency, and behaviors during pre-training and fine-tuning in the construction of MoE models. Moreover, there has been extensive exploration into routing techniques that determine how data interact with the experts, such as deterministic hash routing [
18], routing via reinforcement learning [
19], and the Sinkhorn redistribution algorithm [
20].
3. Problem Definition
We verify the effectiveness of our proposed framework on three typical table-related tasks: table-based fact verification, table-based question answering, and table-to-text generation. The input for the first two tasks consists of table and text, while the input for the latter is only table. We systematically present the primary symbols utilized throughout this paper in
Table 1, facilitating a clear and organized representation. In all tasks, table
consists of
R rows and
C columns with content
in the
-th cell. The value
can be a number, a word, or a phrase. Table
may also include a title, denoted as a NL sequence
w. In following sections, we also represent all inputs by
and all outputs by
, in order to describe the process of different tasks in a generalized manner. We formally define these related tasks as below:
Problem 1 (table-based question answering): Given a table and an NL question , the goal is to accurately extract relevant information from table and generate an appropriate answer for the question. The process can be represented as .
Problem 2 (table-based fact verification): Given a table and an NL statement , which describes a fact to be verified against the content in table, the goal is to predict the verification label of statement. If the statement s is entailed by table , then , which means statement s is consistent with ; otherwise it is refuted by and the label is . It is a binary classification process .
Problem 3 (table-to-text generation): Given a table , the goal is to generate several sentences , which are both fluent and logically (numerically) supported by table . The generation process is denoted as .
4. Methodology
In this section, we present the details of our framework, the table mixture-of-experts network (TabMoE), designed for tabular logical reasoning. The TabMoE first processes table and associated text, learning their embeddings by aggregating contextual information. Multiple distinct experts, each specializing in different logical types, are then utilized for logical operations on table records. The overall architecture of the TabMoE is illustrated in
Figure 2. For better general expression, we summarize all tasks as
where
represents all the inputs in one task, such as the input table–question pair
in the question answering task,
represents the outputs, such as the binary label in fact verification, and
denotes all the trainable parameters in model.
We provide an in-depth discussion of our methodology in subsequent subsections:
Section 4.1 delineates the pre-processing and representation of tables and text.
Section 4.2 introduces the expert modules responsible for different logical types.
Section 4.3 explains training strategy of the TabMoE and how we maintain the distinctiveness of each expert.
4.1. Table Pre-Processing and Representation
Before learning the representation of input data, we perform a pre-processing procedure on tables. This pre-processing includes table pruning and serializing, aimed at meeting the input length limit and input format of pre-trained models. The table stores a wealth of information, but not all of it is necessary to be utilized in a single operation. Table pruning removes some parts of the table, leaving the more informative parts for subsequent operations. We refine the pruning algorithm introduced by Chen et al. [
21]. The initial method links entities in a sentence with cells in the table in a heuristic way, then selects columns containing matched cells. We further continuously add columns which are not chosen to the pruned table until the maximum input length is reached, aiming to avoid failures in retaining critical columns.
Owing to the remarkable proficiency exhibited by pre-trained transformers in the semantic understanding of textual data, we employ them to encode tabular data, capturing their inherent semantic features. However, a table with a two-dimensional structure cannot be directly processed by pre-trained models, which necessitates a one-dimensional input sequence. Prior research typically employs templates or special delimiters to linearize the table into an extended sequence through horizontal or vertical scanning. In our approach, we concatenate the values of each cell with their respective header for each row in the table. For instance, considering the table in
Figure 1, the first row will be represented as follows: “in row 1, year: 1896 | city: Athens | country: Greece | nations: 14”. Then we have the serialized table
by concatenating all serialized rows horizontally. For table-based question answering and fact verification, the final input sequence is formed like
. For table-to-text generation, the sequence is
. In conformity with customary practices in pre-trained models, the sequence
is inserted with special tokens to meet the input format. Despite the existence of parameter competition among different types of logical reasoning, the input data still share fundamental semantic features. Therefore, a shared bottom encoder, composed of several transformer layers, is deployed to capture low-level contextual semantics:
For each input sequence
, the bottom encoder
encodes it into contextualized embeddings
, where
l is the maximum length of input sequence and
d is the dimension of the hidden vector. Subsequently, the latent representations go through the following specialized experts to extract high-level logical semantics for reasoning.
4.2. Experts
In the TabMoE, each expert is in charge of a distinct subset of data based on its unique and specialized functionality. As illustrated in
Figure 2, we build a set of modules with
K experts. These independent experts systematically decompose an overall task into discrete subtasks by logical types and simulate the corresponding reasoning processes. It is noteworthy that all the experts operating within this framework share the same input tensor and network structure. The structure of the experts can be flexibly adjusted to align with the specific requirements of a given task. The feature embeddings
obtained in
Section 4.1 are indiscriminately input into the activated experts. Here, for classification tasks like fact verification, each expert consists of transformer layers
and perceptron layers
:
where
i is the index of the expert,
is the hidden vector of the first token in sequence,
is a two-layer perceptron,
is the softmax function,
is the classification prediction of the
i-th expert. As for generation tasks such as question answering and table-to-text generation, the expert
functions as a decoder, which is also composed of transformer layers:
where
is the generation text of
i-th expert.
The experts themselves are neutral and undirected, and certain tasks, such as question answering, usually require only one expert. Therefore, we add a gating module to route data tensors towards a particular expert based on data features. Here, we employ several transformer layers
and perceptron layers
as a gating module:
where
is the hidden vector of the first token in sequence,
is a two-layer perceptron,
is the probability of logical types predicted by gating module, and
K is the number of experts. Depending on score vector
, the
t-th expert is selected as the reasoning expert for the target logical type. However, there are no annotated labels for logical types within these datasets, so we use the unsupervised EM algorithm for the training process, as elucidated in
Section 4.3.
4.3. Training
The training of a multi-expert model is different from traditional model training paradigms. In our framework, we need multi-expert training to guide each expert in learning a particular type of logical reasoning while maintaining balanced training across all experts. This necessitates that one expert’s parameters should be trained using pertinent types of data, ensuring that all the experts within the model can be fully trained. However, explicit logical type annotations for data are not available in the tasks addressed by our framework. Therefore, we employ unsupervised training by introducing a latent variable
as an expert’s index. For the simplified model Equation (
1), which includes the source input
, target label
, and parameters
, the marginal likelihood can be decomposed as follows:
where
is the prior probability, and
is the likelihood probability. Each expert takes responsibility for sample observation
through the posterior probability:
For classification tasks, the loss function is cross-entropy loss. For generation tasks, the loss utilized for auto-regression is also cross-entropy loss; both are denoted as follows:
After introducing the latent variable
z, maximizing the log likelihood can be decomposed as follows:
Typically, the EM algorithm is suitable for training models in this context. However, our model exhibits some unique characteristics: each expert within model is assigned with operations belong to a distinct logical category, which requires that each sample is allocated to only one expert during the training process. Given this setup, we utilize the hard EM algorithm. While the standard EM algorithm iteratively approximates the maximization of objective function, the hard EM algorithm selects only the most probable expert for updating in each iteration. This strategy enhances the framework’s learning efficiency, meeting the training needs of our gate module. Moreover, it reduces the computational complexity, significantly lowers the GPU memory consumption, and accelerates convergence. The whole training process involves the iteration of following two steps until convergence is reached:
- E-step:
given the current parameters , estimate the probabilities of each expert and identify the target expert by .
- M-step:
based on the expert selected in E-step, update the parameters by maximizing the likelihood.
5. Experiment Setup
In this section, we present the datasets, evaluation metrics, and implementations of experiments on three distinct types of tasks associated with logical reasoning.
5.1. Datasets and Evaluations
We select three widely used representative datasets of these tasks, including weakly supervised WikiSQL [
22], TabFact [
21], and LogicNLG [
23], corresponding to table-based question answering, fact verification, and table-to-text generation, respectively.
Table 2 shows the basic statistics of these datasets. Note that in all the datasets, the training, validation, and test sets have no overlap in the tables. The evaluation metrics for each dataset are outlined below:
WikiSQL: It is the largest crowd-sourced dataset focused on logical forms, designed to develop natural language interfaces for relational databases. Its designated evaluation metric is denotation accuracy, which checks the congruence between the predicted answer and the reference answer. We apply the evaluation implementation provided by TAPAS [
4].
TabFact: It is a large-scale dataset consisting of 16 k Wikipedia tables used as evidence for 118 k human-annotated NL statements, which require both soft linguistic reasoning and hard symbolic reasoning. The evaluation metric for TabFact is accuracy, computed as the proportion of correct predictions. The test set is further split into simple/complex channels, where simple ones only refer to a single row or record, and complex ones require higher-order semantic operations like argmax and count. A small test set containing about 2 K samples has undergone human evaluation with an accuracy of 92.1%.
LogicNLG: Unlike previous data-to-text datasets which purely focus on sequence generation, LogicNLG specifically emphasizes inferences involving symbolic operations over the given table. Its evaluation metrics include BLEU for assessing surface-level consistency, along with the accuracy of semantic parsing (SP-Acc) and accuracy of natural language inference (NLI-Acc) for assessing logic-level fidelity. It is worth noting that SP-Acc and NLI-Acc are based on previous models, whose accuracies are merely satisfactory, implying that these two metrics lack high precision and cannot serve as gold-standard measurements.
5.2. Experimental Settings
All the experts in the TabMoE share an identical structure and initialization. We initialize the encoders and experts in the TabMoE with pre-trained parameters from BART [
24], and the remaining parameters are initialized randomly. BART is a pre-trained language model on free-form text with an encoder–decoder architecture, both of which have the same number of transformer layers. Specifically, the TabMoE’s bottom encoder is initialized with BART’s encoder, and all the experts are initialized with BART’s decoder. In order to ensure fair comparisons with baseline models possessing commensurate parameter magnitudes, we employ BART-large (24 layers) for question answering and fact verification, and BART-base (12 layers) for table-to-text generation.
We set
types of experts for fact verification and table-to-text generation, and
for question answering. Our implementations are based on Pytorch and Transformer library [
25,
26]. The transformer parameter settings align with the default configurations of BART. All experiments are conducted on two RTX-TiTan GPUs, using a batch size of 16 for large-size model and 48 for base-size model. We employ the AdamW optimizer with a learning rate ranging from 10
−5 to 5 × 10
−5, saving checkpoints with the best performance on the validation set at intervals of 2000 steps. All the experiments are trained within 15 epochs.
6. Results
In this section, we evaluate the experimental results and engage in an in-depth discussion concerning the TabMoE’s performance across three table-related tasks to verify its effectiveness and potential advantages.
6.1. Overall Performance
Comparisons of the TabMoE with baselines on WikiSQL, TabFact, LogicNLG are summarized in
Table 3,
Table 4 and
Table 5, respectively. We present the performance of the TabMoE based on five random runs.
Table-based question answering. As shown in
Table 3, the TabMoE exhibits notably superior performance when contrasted with all baseline models on the widely used WikiSQL dataset. Its efficacy is evident through the attainment of an 89.6% accuracy on the validation set and an 89.2% accuracy on the test set. This denotes a 3.0% improvement over BART and a notable 2.8% enhancement over the large table pre-trained model, TAPAS, specifically on the test set. These results are particularly significant given that TAPAS is a pre-trained model explicitly designed for table-based question answering, with extensive pre-training on a large corpus of 6.2 M tables, providing it with substantial prior knowledge of tables. Despite this advantageous background, the TabMoE surpasses TAPAS by solely leveraging tabular data within the dataset, highlighting its ability to outperform a specialized table pre-trained model while relying on a more limited information source.
Table-based fact verification.Table 4 presents the performance of diverse models on distinct subsets of TabFact. The TabMoE achieves accuracies of 84.7% and 84.6% on the validation and test sets, respectively, which surpasses TAPAS by 3.4% and Decomp by 1.9%. The poor performance of the large-scale table pre-trained model TAPAS indicates the limitations of table pre-training, often characterized by proficiency in pre-training tasks but lacking generalizability to other tasks. Notably, the improvement in the TabMoE manifests prominently in the complex channel, showing its heightened capacity for complex reasoning on tables compared to the previous models. This augmentation signifies the TabMoE’s proficiency in acquiring logic-level semantics and performing more complex inference.
Table-to-text generation. The results of the TabMoE and baselines on LogicNLG are presented in
Table 5, encompassing assessments of both surface-level and logic-level fidelity. The TabMoE exhibits enhancements in surface-level fidelity with baselines, while maintaining a competitive performance in logic-level fidelity. As mentioned in
Section 5.1, given the unreliability of SP-ACC and NLI-ACC in table-to-text task, the most valued evaluation metrics are BLEU-1/2/3. The TabMoE attains notable scores of 54.4/33.3/19.3, surpassing R2D2 by 2.6/1.0/0.7, respectively. Considering that R2D2 is a pre-trained model specifically designed for table-to-text task, the results underscore the TabMoE’s capacity to yield superior outcomes in generation tasks even with limited training data, outperforming specialized pre-trained models.
The aforementioned results on three tasks, which outperform various baseline models, showcase the overall strength and generalization of our framework on both classification and generation tasks. Our model eliminates the need for extensive table-specific pre-training on large amounts of related data, yet it can still outperform pre-trained models when trained directly on a target dataset. This highlights the TabMoE’s substantial resource efficiency and its superior effectiveness in solving table-based reasoning tasks. Additionally, this approach underscores the advantage of identifying logical types and assigning specialized experts to handle distinct logical operations and reasoning, which aligns more closely with the problem-solving framework for table-based reasoning. Further details and insights into these findings will be explored through a series of diverse experiments in subsequent sections.
6.2. Analysis
We offer a detailed presentation and analysis regarding the impact of model architecture and training data on performance outcomes. The experimental analysis includes the influence of the number of experts, the impact of logical type, and the differentiation of the experts. Furthermore, we conduct case study and explore potential advantages.
Number of experts. To evaluate the influence of a gate module and the number of experts on model performance, we carry out experiments with different numbers of experts on three tasks. The corresponding results are graphically depicted in
Figure 3. It is discernible that disparate datasets exhibit a degree of commonality in observed trends. When the number of experts
, which is equivalent to not having a gate module, the ablation experiment results clearly demonstrate that the introduction of a gate module leads to significant performance improvements across these tasks. The improvement is particularly evident in simpler tasks, while in more complex tasks, a gate module needs to be combined with suitable experts to fully realize its potential. Considering the overall trend, there is a marked upswing in model performance as the number of experts increases. However, after reaching a certain threshold, the results undergo a notable plateau, followed by either an oscillation or decline. The optimal number of experts reaching peak performance in WikiSQL is smaller than in other two datasets, which is possibly attributable to the lower demand for logical reasoning ability in WikiSQL. The observed performance decline can largely be attributed to limitations in the dataset’s quality and quantity. Finer subdivisions may cause insufficient training data for each expert, impeding adequate training for both the gate module and the experts, and thereby negatively impacting model performance. Additionally, increasing the number of experts results in a rapid increase in model parameters, necessitating the identification of a suitable number. The selection of the optimal number of experts should be made based on the difficulty and the quality of the dataset, ensuring the gate module’s discriminative efficacy and the experts’ inferential capabilities.
Performance of each expert. We analyze each expert independently by evaluating its performance on the data assigned to it by the gate module, aiming to assess how effectively it adapts to various logical operations. For this analysis, we utilize TabFact due to its rich variety of logical types and provision of simple and complex channels in the dataset. Based on the probabilities from the gate module, the dataset is categorized into five subsets. Although the classification result of gate module may not be entirely precise, it can still realize a rough categorization of samples, allowing for a discernment of the influence of logical types. The logical types of these data subsets are assumed based on key trigger words provided in TabFact [
21], which are not entirely accurate. It is used for rough classification, and the categories include the following: “count”, “comparative”, “superlative”, “negation” and “other”. In
Figure 4, the accuracy of each expert on the corresponding test subset is compared across three channels. The fact that the experts specializing in “comparative” and “negation” operations attain higher accuracy in the complex channel suggests their proficiency in table cell selection. The performance experiences a slight decline for the “count” and “superlative” operations, hinting at the increased difficulty in mastering reasoning associated with table aggregation. The imbalanced performance could be linked to the unequal distribution of logical types in the training data, highlighting a potential focus for future improvements.
Differentiation among experts. The differentiation of the experts is necessary for performing different logical operations in the TabMoE. To assess this, we evaluate the outputs of individual experts. We first test each expert’s ability in handling all logical types of data within TabFact. This experiment removes the gate module, with each expert tasked to analyze every sample in the test set. For each sample, the count of experts predicting correctly is recorded.
Figure 5 illustrates the correlation between the count of experts
k and the proportion of correct predictions within the test set. Evidently, as
k increases, the proportion of samples correctly predicted by
k experts gradually decreases. This trend suggests that certain data require specific expert reasoning for accurate results, thereby highlighting the variances among the experts. Additionally, we indirectly verify the differentiation of the experts by measuring the diversity of generated sentences on the LogicNLG dataset. The self-BLEU metric [
42] is employed for this purpose, which replaces reference text in the conventional BLEU metric with the generated text itself. Self-BLEU is computed by calculating the BLUE score of each sentence with the rest of the sentences in a generated collection and then averaging the BLEU scores of all generated sentences. A lower self-BLEU score corresponds to a higher diversity of generated outputs. The results, as presented in
Table 6, demonstrate that the text generated by the TabMoE displays significantly better diversity compared to other models (we only compare diversity among models with available source codes). This observation underscores the capacity of the experts in generating more diverse text, thereby confirming the obvious distinctions among the experts. Simultaneously, this feature confers an additional advantage to these generative tasks, enhancing the diversity of generated outputs—an aspect often overlooked by other models.
Time complexity. When calculating the time complexity of our algorithm, the key variables to consider include the number of samples N, the number of logical types K (i.e., the number of experts), and the number of iterations I of the algorithm. In our model, the feature dimension is a fixed constant and can therefore be excluded from complexity analysis. During the E-step of the hard EM algorithm, the most probable logical type for each sample needs to be identified. This requires computing the probability of each sample for every logical type and then selecting the type with the highest probability, resulting in a time complexity of . In the M-step, the parameters of the expert corresponding to the identified logical type are updated. For K logical types, all samples need to be scanned, leading to a time complexity of . Combining these two steps, the time complexity for each iteration is . When the algorithm converges after I iterations, the overall time complexity is .
6.3. Case Study
To more intuitively highlight the advantages of our approach and clarify the contributions of each module, we perform case studies on two challenging datasets, as illustrated in
Figure 6. Since none of these datasets include annotations for the logical types, it is impossible to directly evaluate the classification accuracy of the gate module. Therefore, we aim to assess the gate module’s classification ability indirectly through these examples. The first case pertains to the table-to-text generation task, where the gray parts represent the sentences generated by the TabMoE and the GPT-Coarse-to-Fine model. It is clear that the TabMoE’s generated sentences are both fluent and logically aligned with table-based reasoning. In contrast, most sentences generated by GPT-Coarse-to-Fine fail to draw the correct conclusions from the table, with very limited logical types. The only logical type present is the “count” type in the first sentence, but the result is also incorrect. We can directly identify the logical types of the TabMoE’s sentences as “count”, “comparative”, “superlative”, “none”, and “other”, respectively. This suggests that the experts generated sentences matching their corresponding logical types, indirectly indicating that the gate module successfully classified sentences during training. The second example, which is more straightforward, is drawn from the table-based fact verification task. The gray part represents the statements requiring verification. In this task, the gate module directly classifies the logical type of these statements, predicting the results as “superlative”, “superlative”, “count”, “superlative”, and “count”—all of which are correct. This case provides a clearer view of the classification capabilities of the gate module, proving that it achieves a relatively high level of accuracy. Furthermore, the assignment by the gate module enables the experts to better fulfill their respective roles. Our MoE-based framework differentiates the experts according to logical types, aligning naturally with the diverse operations required for table reasoning. This not only boosts the model’s predictive performances but also grants a degree of interpretability, a feature that is difficult for other models to achieve.
6.4. Limits
The first limitation of the TabMoE resides in its unsuitability for scenarios involving extremely small datasets. As the TabMoE does not rely on tabular pre-trained models, it lacks prior knowledge for various operations conducted on tables. In cases where the training data are insufficient, this can lead to a lack of differentiation among the experts, resulting in a reduced overall performance enhancement. Another limitation is related to its suboptimal handling of large tables. Constrained by the input length of the pre-trained model, we opt to remove some irrelevant columns to compress the table during the pre-processing step. However, when confronted with tables abundant in pertinent information, such a process compromises models’ efficacy on target tasks. Addressing these limitations represents potential directions for future work to expand the TabMoE’s capabilities.
7. Conclusions
In this paper, we present the TabMoE, a general framework applicable to diverse tasks associated with logical reasoning over tables. Employing mixture-of-experts as foundation, this framework equips each expert with proficiency in specific logical operations, and facilitates their training through an unsupervised hard EM algorithm. We conduct experiments on three disparate table-related tasks: table-based question-answering, table-based fact detection, and table-to-text generation. The results across these classification and generation tasks demonstrate the TabMoE’s superior performance and wide applicability. Remarkably, the TabMoE dispenses with any reliance on tabular pre-trained models, achieving outstanding results solely through task-specific datasets. Abundant analytical experiments highlight the framework’s efficacy and the functionality of its individual modules. This method not only resolves tasks related to tabular data but also offers a viable solution for tasks involving multiple data types or requiring multifaceted functionalities. In prospective research, we aspire to tackle more difficult logical reasoning, address the decomposition of complex operations over table, and devise strategies for combining experts to handle intricacies effectively.