TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts

Wu, Jie; Hou, Mengshu

doi:10.3390/math12193031

Open AccessArticle

TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts

by

Jie Wu

and

Mengshu Hou

^*

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3031; https://doi.org/10.3390/math12193031

Submission received: 16 August 2024 / Revised: 21 September 2024 / Accepted: 26 September 2024 / Published: 27 September 2024

(This article belongs to the Special Issue Deep Learning for Natural Language Processing: Advances and Challenges)

Download

Browse Figures

Versions Notes

Abstract

:

Tables serve as a widely adopted data format, attracting considerable academic interest concerning semantic understanding and logical inference of tables. In recent years, the prevailing paradigm of pre-training and fine-tuning on tabular data has become increasingly prominent in research on table understanding. However, existing table-based pre-training methods frequently exhibit constraints, supporting only single tasks while requiring substantial computational resources, which hinders their efficiency and applicability. In this paper, we introduce the TabMoE, a novel framework based on mixture-of-experts, designed to handle a wide range of tasks involving logical reasoning over tabular data. Each expert within the model specializes in a distinct logical function and is trained through the utilization of a hard Expectation–Maximization algorithm. Remarkably, this framework eliminates the necessity of dependency on tabular pre-training, instead exclusively employing limited task-specific data to significantly enhance models’ inferential capabilities. We conduct empirical experiments across three typical tasks related to tabular data: table-based question answering, table-based fact verification, and table-to-text generation. The experimental results underscore the innovation and feasibility of our framework.

Keywords:

tabular data; table understanding; table reasoning; mixture-of-experts; natural language processing

MSC:

68T50

1. Introduction

Tables are widely used tools for data visualization and analysis, organizing information systematically into rows and columns. They enable the structured representation of diverse data types, including numerical, textual, and Boolean values, conferring significant advantages for data storage, retrieval, and analysis. The inherent structural clarity and high information density of tables make them applicable across various domains, spanning from business and scientific research to everyday tasks. By providing organized data, tables enhance work efficiency and support higher quality decision-making.

Tables frequently co-occur with textual data, with both formats complementing each other within a given context. Many tasks require joint reasoning over data presented in these two formats. However, tables diverge from sequential natural language (NL) in that the information they convey is more succinct and features a two-dimensional structure. In contrast to the conventional challenges in natural language processing (NLP), research on tables introduces additional complexities: (1) The structural information within a table is critical, yet an optimal method for encoding this structure is presently lacking. (2) Table understanding necessitates not only surface-level semantic parsing but also the execution of diverse logical operations such as argmax, count, and average. (3) A semantic gap exists between NL and tables, which is particularly evident in the utilization of pre-trained models. The comprehension and application of tables particularly underscore the logical operations derived from tabular data. Logical operations refer to a set of operations used for reasoning and manipulating table contents, which generally involve the analysis and comparison of rows or columns across various attributes. Typical examples of logical operations include superlative, which identifies the maximum or minimum values within table; counting, which calculates the number of entries that satisfy a given condition; comparison, which compares the relationships between different entries, and so on. While the information presented in tables may be straightforward, the logical relationships and numerical connections between table records are not apparent, requiring advanced reasoning capabilities.

Pre-trained language models like BERT [1] and GPT [2] have exhibited notable proficiency in dealing with various free-form NL tasks. Motivated by this substantial achievement, researchers seek to extend pre-training to structured tabular data, aiming to create generalizable models that can jointly learn enhanced representations of both tabular and textual data. Most table pre-trained models focus on designing pretext tasks, encoding structural information, and synthesizing training data. However, the computational cost of table pre-training is generally much higher than that of text pre-training. Despite these efforts, table pre-trained models fail to achieve the same level of generalization as text pre-trained models and often find application limited to a specific task. For example, table pre-trained models such as TaBERT [3] and TAPAS [4] are only suitable for table-based question-answering tasks. Given the relatively modest performance gains achieved, their high resource consumption is not cost-effective.

We contend that existing text pre-trained models are sufficient for capturing the semantic information embedded within tables, so our focus is directed towards addressing logical reasoning based on tables. In human cognition, logical reasoning is typically performed by first categorizing the types of operations and then applying the corresponding functions. Moreover, intuitively, it should be much easier for a model to execute a single type of logical operation rather than managing various types simultaneously. Hence, we subdivide table operations based on logical types, employing distinct modules to emulate the execution of different logical operations, each independently yielding corresponding outcomes. In line with this conceptualization, we devise a universal framework named TabMoE, based on the mixture-of-experts (MoE) architecture, which is applicable across diverse types of table-related tasks. Specifically, the TabMoE encompasses multiple experts, each functioning as an agent for some specific types of logical reasoning, along with a gate responsible for selecting the appropriate expert. It is noteworthy that the logical types of samples in datasets are not annotated, so we cultivate the classification capability of the gate module in an unsupervised manner, by using a hard Expectation–Maximization (EM) algorithm.

We conduct experiments on three distinct tasks that involve higher-order tabular logical reasoning. Figure 1 illustrates examples of these tasks: table-based question answering, table-based fact verification, and table-to-text generation. The first task is to extract or infer answers from a table according to queries, the second is to check facts in sentences against table, and the third is to generate sentences containing logical reasoning derived from table content.

Our contributions are threefold:

We propose a general framework based on the mixture-of-experts (MoE) concept to address logical reasoning challenges in table understanding. This framework exhibits applicability across a variety of table-related tasks, including both classification and generation tasks.
Our framework operates independently of table pre-trained models and demonstrates competitive performance even when trained on limited datasets.
To substantiate of our framework’s superiority, we conduct a series of empirical experiments, assessing its effectiveness, affirming its practical applicability, and delineating the additional advantages conferred by this framework.

2. Related Work

2.1. Tabular Data Representation

Following the tremendous success of pre-trained free-text representation learning in NLP, there has been increasing interest in recent years in exploring representation learning for tabular data. The basic paradigm of pre-training involves capturing general semantic knowledge from large-scale textual data through self-supervised learning, followed by fine-tuning for various downstream tasks. Table pre-training largely adheres to this approach: it begins by gathering extensive datasets of tables along with relevant contextual information (such as headers, auxiliary data, and textual content), followed by employing the appropriate pre-training tasks aimed at learning general tabular representations to improve performance on downstream applications. Research efforts made in table pre-training primarily focus on improvements in three aspects: encoding methods, model structures, and pre-training tasks. Enhancements at the input layer typically involve the incorporation of additional positional embeddings to enhance the encoding of structural information within tables, including embeddings for rows [4], columns [5], and hierarchical structures [6]. Improvements in model structures often concentrate on optimizing attention mechanisms within transformers, including vertical attention [3], tree attention [6], and sparse attention [7]. Advancements in pre-training tasks exhibit diversified directions. Beyond drawing inspiration from tasks in text pre-training, such as masking and corrupting, there are also classification and ranking objectives tailored to specific tasks.

Owing to the complexity of table semantic parsing, the data demands for table pre-training considerably surpass those for text pre-training. Despite substantial investments into data collection and computational resources, table pre-training has consistently lagged behind text pre-training in terms of representation effectiveness and generalization capabilities. This issue of resource inefficiency is a critical concern. Additionally, current table pre-training methods typically demonstrate effectiveness only in one or two specific downstream tasks. The high resource demand and limited application scalability have sparked ongoing debates about whether table pre-training is truly the optimal way to advance table understanding. To date, no model structure or training method has emerged as a clear leader or achieved a notable breakthrough in the field of table pre-training. Balancing model performance with reduced resource consumption, while also enhancing generalization and scalability, remains a critical challenge for further exploration in table representation learning.

2.2. MoE

The concept of MoE (mixture-of-experts) is the parallel construction of multiple modular branches as experts within a single system. The core components of MoE include three parts: experts, gating, and routing. Usually, each expert functions as an independent sub-model that specializes in processing a distinct subset of input data. The gating mechanism generates weights to determine which experts need to be activated, while the routing mechanism refers to the strategy of assigning the experts and plays a crucial role in shaping the sparsity of MoE. MoE models manifest broad applicability across diverse domains, including computer vision [8], natural language processing [9], and speech recognition [10]. In the field of table understanding, SaMoE [11] employs the MoE framework with a self-adaptive method for table-based fact verification. However, SaMoE includes a necessary step where it calculates the weighted sum of all the experts’ outputs, which makes it suitable only for classification tasks that compute probabilities, like table-based fact verification. This design limits its applicability to generative tasks, like table question-answering and table-to-text generation. In contrast, our method ensures that each expert can independently perform the task, enabling its use in both classification and generative tasks.

Depending on how weights are allocated to the experts, MoE models can be categorized into soft weighting and hard gating. Traditional soft MoE methods dynamically adjust the combination of different experts using real-valued weights. However, this necessitates the execution of all the experts, which prevents a reduction in computational overhead during inference. In contrast, hard gating produces an n-dimensional vector with only a few non-zero elements, controlling the execution of the experts. Only a minority of the experts are activated during a single use, markedly reducing computational consumption and augmenting the efficiency of MoE. Currently, hard gating is the prevailing method in MoE and is seamlessly integrated into various network architectures. For instance, HydraNet [12] substitutes the convolutional blocks in a CNN (convolutional neural network) with multiple experts, sparely gated MoE [13] incorporates MoE into LSTM (long short-term memory), and switch transformer [14] integrates MoE into the transformer architecture. Recent research [15,16] has shifted from conceptualizing the experts as layers within a network to regarding them as independent and specialized modules, showcasing advancements in efficiency and composability. Further investigations [14,17] have delved into the scaling properties of the experts, including their size, quantity, frequency, and behaviors during pre-training and fine-tuning in the construction of MoE models. Moreover, there has been extensive exploration into routing techniques that determine how data interact with the experts, such as deterministic hash routing [18], routing via reinforcement learning [19], and the Sinkhorn redistribution algorithm [20].

3. Problem Definition

We verify the effectiveness of our proposed framework on three typical table-related tasks: table-based fact verification, table-based question answering, and table-to-text generation. The input for the first two tasks consists of table and text, while the input for the latter is only table. We systematically present the primary symbols utilized throughout this paper in Table 1, facilitating a clear and organized representation. In all tasks, table

T = {t_{r, c} | r \leq R, c \leq C}

consists of R rows and C columns with content

t_{r, c}

in the

(r, c)

-th cell. The value

t_{r, c}

can be a number, a word, or a phrase. Table

T

may also include a title, denoted as a NL sequence w. In following sections, we also represent all inputs by

x

and all outputs by

y

, in order to describe the process of different tasks in a generalized manner. We formally define these related tasks as below:

Problem 1 (table-based question answering): Given a table $T$ and an NL question $q = q_{1}, q_{2}, \dots$ , the goal is to accurately extract relevant information from table $T$ and generate an appropriate answer $y$ for the question. The process can be represented as $p (y | T, q)$ .
Problem 2 (table-based fact verification): Given a table $T$ and an NL statement $s = s_{1}, s_{2}, \dots$ , which describes a fact to be verified against the content in table, the goal is to predict the verification label $y \in {0, 1}$ of statement. If the statement s is entailed by table $T$ , then $y = 1$ , which means statement s is consistent with $T$ ; otherwise it is refuted by $T$ and the label is $y = 0$ . It is a binary classification process $p (y | T, s)$ .
Problem 3 (table-to-text generation): Given a table $T$ , the goal is to generate several sentences $y = y_{1}, y_{2}, \dots$ , which are both fluent and logically (numerically) supported by table $T$ . The generation process is denoted as $p (y | T)$ .

4. Methodology

In this section, we present the details of our framework, the table mixture-of-experts network (TabMoE), designed for tabular logical reasoning. The TabMoE first processes table and associated text, learning their embeddings by aggregating contextual information. Multiple distinct experts, each specializing in different logical types, are then utilized for logical operations on table records. The overall architecture of the TabMoE is illustrated in Figure 2. For better general expression, we summarize all tasks as

p_{θ} (y | x),

(1)

where

x

represents all the inputs in one task, such as the input table–question pair

(T, q)

in the question answering task,

y

represents the outputs, such as the binary label in fact verification, and

θ

denotes all the trainable parameters in model.

We provide an in-depth discussion of our methodology in subsequent subsections: Section 4.1 delineates the pre-processing and representation of tables and text. Section 4.2 introduces the expert modules responsible for different logical types. Section 4.3 explains training strategy of the TabMoE and how we maintain the distinctiveness of each expert.

4.1. Table Pre-Processing and Representation

Before learning the representation of input data, we perform a pre-processing procedure on tables. This pre-processing includes table pruning and serializing, aimed at meeting the input length limit and input format of pre-trained models. The table stores a wealth of information, but not all of it is necessary to be utilized in a single operation. Table pruning removes some parts of the table, leaving the more informative parts for subsequent operations. We refine the pruning algorithm introduced by Chen et al. [21]. The initial method links entities in a sentence with cells in the table in a heuristic way, then selects columns containing matched cells. We further continuously add columns which are not chosen to the pruned table until the maximum input length is reached, aiming to avoid failures in retaining critical columns.

Owing to the remarkable proficiency exhibited by pre-trained transformers in the semantic understanding of textual data, we employ them to encode tabular data, capturing their inherent semantic features. However, a table with a two-dimensional structure cannot be directly processed by pre-trained models, which necessitates a one-dimensional input sequence. Prior research typically employs templates or special delimiters to linearize the table into an extended sequence through horizontal or vertical scanning. In our approach, we concatenate the values of each cell with their respective header for each row in the table. For instance, considering the table in Figure 1, the first row will be represented as follows: “in row 1, year: 1896 | city: Athens | country: Greece | nations: 14”. Then we have the serialized table

T^{*}

by concatenating all serialized rows horizontally. For table-based question answering and fact verification, the final input sequence is formed like

x = [[CLS], q / s, [SEP], w, T^{*}]

. For table-to-text generation, the sequence is

x = [[CLS], w, T^{*}]

. In conformity with customary practices in pre-trained models, the sequence

x

is inserted with special tokens to meet the input format. Despite the existence of parameter competition among different types of logical reasoning, the input data still share fundamental semantic features. Therefore, a shared bottom encoder, composed of several transformer layers, is deployed to capture low-level contextual semantics:

H = f_{L} (x) .

(2)

For each input sequence

x

, the bottom encoder

f_{L}

encodes it into contextualized embeddings

H \in R^{l \times d}

, where l is the maximum length of input sequence and d is the dimension of the hidden vector. Subsequently, the latent representations go through the following specialized experts to extract high-level logical semantics for reasoning.

4.2. Experts

In the TabMoE, each expert is in charge of a distinct subset of data based on its unique and specialized functionality. As illustrated in Figure 2, we build a set of modules with K experts. These independent experts systematically decompose an overall task into discrete subtasks by logical types and simulate the corresponding reasoning processes. It is noteworthy that all the experts operating within this framework share the same input tensor and network structure. The structure of the experts can be flexibly adjusted to align with the specific requirements of a given task. The feature embeddings

H

obtained in Section 4.1 are indiscriminately input into the activated experts. Here, for classification tasks like fact verification, each expert consists of transformer layers

f_{e_{i}}

and perceptron layers

f_{m l p_{e_{i}}}

:

h_{e_{i}} = f_{e_{i}} (H),

(3)

y_{i} = σ (f_{m l p_{e_{i}}} (h_{e_{i}})),

(4)

where i is the index of the expert,

h_{e_{i}} \in R^{d}

is the hidden vector of the first token in sequence,

f_{m l p_{e_{i}}}

is a two-layer perceptron,

σ

is the softmax function,

y_{i}

is the classification prediction of the i-th expert. As for generation tasks such as question answering and table-to-text generation, the expert

d_{e_{i}}

functions as a decoder, which is also composed of transformer layers:

y_{i} = d_{e_{i}} (H),

(5)

where

y_{i}

is the generation text of i-th expert.

The experts themselves are neutral and undirected, and certain tasks, such as question answering, usually require only one expert. Therefore, we add a gating module to route data tensors towards a particular expert based on data features. Here, we employ several transformer layers

f_{G}

and perceptron layers

f_{m l p_{G}}

as a gating module:

h_{G} = f_{G} (H),

(6)

d_{G} = σ (f_{m l p_{G}} (h_{G})),

(7)

t = \underset{k}{argmax} d_{G} [k],

(8)

where

h_{G} \in R^{d}

is the hidden vector of the first token in sequence,

f_{m l p_{G}}

is a two-layer perceptron,

d_{G} \in R^{K}

is the probability of logical types predicted by gating module, and K is the number of experts. Depending on score vector

d_{G}

, the t-th expert is selected as the reasoning expert for the target logical type. However, there are no annotated labels for logical types within these datasets, so we use the unsupervised EM algorithm for the training process, as elucidated in Section 4.3.

4.3. Training

The training of a multi-expert model is different from traditional model training paradigms. In our framework, we need multi-expert training to guide each expert in learning a particular type of logical reasoning while maintaining balanced training across all experts. This necessitates that one expert’s parameters should be trained using pertinent types of data, ensuring that all the experts within the model can be fully trained. However, explicit logical type annotations for data are not available in the tasks addressed by our framework. Therefore, we employ unsupervised training by introducing a latent variable

z \in {1, \dots, K}

as an expert’s index. For the simplified model Equation (1), which includes the source input

x

, target label

y

, and parameters

θ

, the marginal likelihood can be decomposed as follows:

p_{θ} (y | x) = \sum_{z = 1}^{K} p_{θ} (z | x) p_{θ} (y | z, x),

(9)

where

p_{θ} (z | x)

is the prior probability, and

p_{θ} (y | z, x)

is the likelihood probability. Each expert takes responsibility for sample observation

(x, y)

through the posterior probability:

p_{θ} (z | x, y) = \frac{p_{θ} (z | x) p_{θ} (y | z, x)}{\sum_{z^{'}} p_{θ} (z^{'} | x) p_{θ} (y | z^{'}, x)} .

(10)

For classification tasks, the loss function is cross-entropy loss. For generation tasks, the loss utilized for auto-regression is also cross-entropy loss; both are denoted as follows:

L = E [- \log p_{θ} (y | x)] .

(11)

After introducing the latent variable z, maximizing the log likelihood can be decomposed as follows:

\nabla \log p_{θ} (y | x) \approx \sum_{z = 1}^{K} p_{θ} (z | x, y) \cdot \nabla \log p_{θ} (y, z | x) .

(12)

Typically, the EM algorithm is suitable for training models in this context. However, our model exhibits some unique characteristics: each expert within model is assigned with operations belong to a distinct logical category, which requires that each sample is allocated to only one expert during the training process. Given this setup, we utilize the hard EM algorithm. While the standard EM algorithm iteratively approximates the maximization of objective function, the hard EM algorithm selects only the most probable expert for updating in each iteration. This strategy enhances the framework’s learning efficiency, meeting the training needs of our gate module. Moreover, it reduces the computational complexity, significantly lowers the GPU memory consumption, and accelerates convergence. The whole training process involves the iteration of following two steps until convergence is reached:

E-step:: given the current parameters $θ$ , estimate the probabilities of each expert and identify the target expert by ${argmax}_{z} p_{θ} (y, z | x)$ .
M-step:: based on the expert selected in E-step, update the parameters $θ$ by maximizing the likelihood.

5. Experiment Setup

In this section, we present the datasets, evaluation metrics, and implementations of experiments on three distinct types of tasks associated with logical reasoning.

5.1. Datasets and Evaluations

We select three widely used representative datasets of these tasks, including weakly supervised WikiSQL [22], TabFact [21], and LogicNLG [23], corresponding to table-based question answering, fact verification, and table-to-text generation, respectively. Table 2 shows the basic statistics of these datasets. Note that in all the datasets, the training, validation, and test sets have no overlap in the tables. The evaluation metrics for each dataset are outlined below:

WikiSQL: It is the largest crowd-sourced dataset focused on logical forms, designed to develop natural language interfaces for relational databases. Its designated evaluation metric is denotation accuracy, which checks the congruence between the predicted answer and the reference answer. We apply the evaluation implementation provided by TAPAS [4].
TabFact: It is a large-scale dataset consisting of 16 k Wikipedia tables used as evidence for 118 k human-annotated NL statements, which require both soft linguistic reasoning and hard symbolic reasoning. The evaluation metric for TabFact is accuracy, computed as the proportion of correct predictions. The test set is further split into simple/complex channels, where simple ones only refer to a single row or record, and complex ones require higher-order semantic operations like argmax and count. A small test set containing about 2 K samples has undergone human evaluation with an accuracy of 92.1%.
LogicNLG: Unlike previous data-to-text datasets which purely focus on sequence generation, LogicNLG specifically emphasizes inferences involving symbolic operations over the given table. Its evaluation metrics include BLEU for assessing surface-level consistency, along with the accuracy of semantic parsing (SP-Acc) and accuracy of natural language inference (NLI-Acc) for assessing logic-level fidelity. It is worth noting that SP-Acc and NLI-Acc are based on previous models, whose accuracies are merely satisfactory, implying that these two metrics lack high precision and cannot serve as gold-standard measurements.

5.2. Experimental Settings

All the experts in the TabMoE share an identical structure and initialization. We initialize the encoders and experts in the TabMoE with pre-trained parameters from BART [24], and the remaining parameters are initialized randomly. BART is a pre-trained language model on free-form text with an encoder–decoder architecture, both of which have the same number of transformer layers. Specifically, the TabMoE’s bottom encoder is initialized with BART’s encoder, and all the experts are initialized with BART’s decoder. In order to ensure fair comparisons with baseline models possessing commensurate parameter magnitudes, we employ BART-large (24 layers) for question answering and fact verification, and BART-base (12 layers) for table-to-text generation.

We set

K = 5

types of experts for fact verification and table-to-text generation, and

K = 4

for question answering. Our implementations are based on Pytorch and Transformer library [25,26]. The transformer parameter settings align with the default configurations of BART. All experiments are conducted on two RTX-TiTan GPUs, using a batch size of 16 for large-size model and 48 for base-size model. We employ the AdamW optimizer with a learning rate ranging from 10⁻⁵ to 5 × 10⁻⁵, saving checkpoints with the best performance on the validation set at intervals of 2000 steps. All the experiments are trained within 15 epochs.

6. Results

In this section, we evaluate the experimental results and engage in an in-depth discussion concerning the TabMoE’s performance across three table-related tasks to verify its effectiveness and potential advantages.

6.1. Overall Performance

Comparisons of the TabMoE with baselines on WikiSQL, TabFact, LogicNLG are summarized in Table 3, Table 4 and Table 5, respectively. We present the performance of the TabMoE based on five random runs.

Table-based question answering. As shown in Table 3, the TabMoE exhibits notably superior performance when contrasted with all baseline models on the widely used WikiSQL dataset. Its efficacy is evident through the attainment of an 89.6% accuracy on the validation set and an 89.2% accuracy on the test set. This denotes a 3.0% improvement over BART and a notable 2.8% enhancement over the large table pre-trained model, TAPAS, specifically on the test set. These results are particularly significant given that TAPAS is a pre-trained model explicitly designed for table-based question answering, with extensive pre-training on a large corpus of 6.2 M tables, providing it with substantial prior knowledge of tables. Despite this advantageous background, the TabMoE surpasses TAPAS by solely leveraging tabular data within the dataset, highlighting its ability to outperform a specialized table pre-trained model while relying on a more limited information source.
Table-based fact verification.Table 4 presents the performance of diverse models on distinct subsets of TabFact. The TabMoE achieves accuracies of 84.7% and 84.6% on the validation and test sets, respectively, which surpasses TAPAS by 3.4% and Decomp by 1.9%. The poor performance of the large-scale table pre-trained model TAPAS indicates the limitations of table pre-training, often characterized by proficiency in pre-training tasks but lacking generalizability to other tasks. Notably, the improvement in the TabMoE manifests prominently in the complex channel, showing its heightened capacity for complex reasoning on tables compared to the previous models. This augmentation signifies the TabMoE’s proficiency in acquiring logic-level semantics and performing more complex inference.
Table-to-text generation. The results of the TabMoE and baselines on LogicNLG are presented in Table 5, encompassing assessments of both surface-level and logic-level fidelity. The TabMoE exhibits enhancements in surface-level fidelity with baselines, while maintaining a competitive performance in logic-level fidelity. As mentioned in Section 5.1, given the unreliability of SP-ACC and NLI-ACC in table-to-text task, the most valued evaluation metrics are BLEU-1/2/3. The TabMoE attains notable scores of 54.4/33.3/19.3, surpassing R2D2 by 2.6/1.0/0.7, respectively. Considering that R2D2 is a pre-trained model specifically designed for table-to-text task, the results underscore the TabMoE’s capacity to yield superior outcomes in generation tasks even with limited training data, outperforming specialized pre-trained models.

The aforementioned results on three tasks, which outperform various baseline models, showcase the overall strength and generalization of our framework on both classification and generation tasks. Our model eliminates the need for extensive table-specific pre-training on large amounts of related data, yet it can still outperform pre-trained models when trained directly on a target dataset. This highlights the TabMoE’s substantial resource efficiency and its superior effectiveness in solving table-based reasoning tasks. Additionally, this approach underscores the advantage of identifying logical types and assigning specialized experts to handle distinct logical operations and reasoning, which aligns more closely with the problem-solving framework for table-based reasoning. Further details and insights into these findings will be explored through a series of diverse experiments in subsequent sections.

6.2. Analysis

We offer a detailed presentation and analysis regarding the impact of model architecture and training data on performance outcomes. The experimental analysis includes the influence of the number of experts, the impact of logical type, and the differentiation of the experts. Furthermore, we conduct case study and explore potential advantages.

Number of experts. To evaluate the influence of a gate module and the number of experts on model performance, we carry out experiments with different numbers of experts on three tasks. The corresponding results are graphically depicted in Figure 3. It is discernible that disparate datasets exhibit a degree of commonality in observed trends. When the number of experts $K = 1$ , which is equivalent to not having a gate module, the ablation experiment results clearly demonstrate that the introduction of a gate module leads to significant performance improvements across these tasks. The improvement is particularly evident in simpler tasks, while in more complex tasks, a gate module needs to be combined with suitable experts to fully realize its potential. Considering the overall trend, there is a marked upswing in model performance as the number of experts increases. However, after reaching a certain threshold, the results undergo a notable plateau, followed by either an oscillation or decline. The optimal number of experts reaching peak performance in WikiSQL is smaller than in other two datasets, which is possibly attributable to the lower demand for logical reasoning ability in WikiSQL. The observed performance decline can largely be attributed to limitations in the dataset’s quality and quantity. Finer subdivisions may cause insufficient training data for each expert, impeding adequate training for both the gate module and the experts, and thereby negatively impacting model performance. Additionally, increasing the number of experts results in a rapid increase in model parameters, necessitating the identification of a suitable number. The selection of the optimal number of experts should be made based on the difficulty and the quality of the dataset, ensuring the gate module’s discriminative efficacy and the experts’ inferential capabilities.
Performance of each expert. We analyze each expert independently by evaluating its performance on the data assigned to it by the gate module, aiming to assess how effectively it adapts to various logical operations. For this analysis, we utilize TabFact due to its rich variety of logical types and provision of simple and complex channels in the dataset. Based on the probabilities from the gate module, the dataset is categorized into five subsets. Although the classification result of gate module may not be entirely precise, it can still realize a rough categorization of samples, allowing for a discernment of the influence of logical types. The logical types of these data subsets are assumed based on key trigger words provided in TabFact [21], which are not entirely accurate. It is used for rough classification, and the categories include the following: “count”, “comparative”, “superlative”, “negation” and “other”. In Figure 4, the accuracy of each expert on the corresponding test subset is compared across three channels. The fact that the experts specializing in “comparative” and “negation” operations attain higher accuracy in the complex channel suggests their proficiency in table cell selection. The performance experiences a slight decline for the “count” and “superlative” operations, hinting at the increased difficulty in mastering reasoning associated with table aggregation. The imbalanced performance could be linked to the unequal distribution of logical types in the training data, highlighting a potential focus for future improvements.
Differentiation among experts. The differentiation of the experts is necessary for performing different logical operations in the TabMoE. To assess this, we evaluate the outputs of individual experts. We first test each expert’s ability in handling all logical types of data within TabFact. This experiment removes the gate module, with each expert tasked to analyze every sample in the test set. For each sample, the count of experts predicting correctly is recorded. Figure 5 illustrates the correlation between the count of experts k and the proportion of correct predictions within the test set. Evidently, as k increases, the proportion of samples correctly predicted by k experts gradually decreases. This trend suggests that certain data require specific expert reasoning for accurate results, thereby highlighting the variances among the experts. Additionally, we indirectly verify the differentiation of the experts by measuring the diversity of generated sentences on the LogicNLG dataset. The self-BLEU metric [42] is employed for this purpose, which replaces reference text in the conventional BLEU metric with the generated text itself. Self-BLEU is computed by calculating the BLUE score of each sentence with the rest of the sentences in a generated collection and then averaging the BLEU scores of all generated sentences. A lower self-BLEU score corresponds to a higher diversity of generated outputs. The results, as presented in Table 6, demonstrate that the text generated by the TabMoE displays significantly better diversity compared to other models (we only compare diversity among models with available source codes). This observation underscores the capacity of the experts in generating more diverse text, thereby confirming the obvious distinctions among the experts. Simultaneously, this feature confers an additional advantage to these generative tasks, enhancing the diversity of generated outputs—an aspect often overlooked by other models.
Time complexity. When calculating the time complexity of our algorithm, the key variables to consider include the number of samples N, the number of logical types K (i.e., the number of experts), and the number of iterations I of the algorithm. In our model, the feature dimension is a fixed constant and can therefore be excluded from complexity analysis. During the E-step of the hard EM algorithm, the most probable logical type for each sample needs to be identified. This requires computing the probability of each sample for every logical type and then selecting the type with the highest probability, resulting in a time complexity of $O (N K)$ . In the M-step, the parameters of the expert corresponding to the identified logical type are updated. For K logical types, all samples need to be scanned, leading to a time complexity of $O (K N)$ . Combining these two steps, the time complexity for each iteration is $O (N K + K N) = O (N K)$ . When the algorithm converges after I iterations, the overall time complexity is $O (I N K)$ .

6.3. Case Study

To more intuitively highlight the advantages of our approach and clarify the contributions of each module, we perform case studies on two challenging datasets, as illustrated in Figure 6. Since none of these datasets include annotations for the logical types, it is impossible to directly evaluate the classification accuracy of the gate module. Therefore, we aim to assess the gate module’s classification ability indirectly through these examples. The first case pertains to the table-to-text generation task, where the gray parts represent the sentences generated by the TabMoE and the GPT-Coarse-to-Fine model. It is clear that the TabMoE’s generated sentences are both fluent and logically aligned with table-based reasoning. In contrast, most sentences generated by GPT-Coarse-to-Fine fail to draw the correct conclusions from the table, with very limited logical types. The only logical type present is the “count” type in the first sentence, but the result is also incorrect. We can directly identify the logical types of the TabMoE’s sentences as “count”, “comparative”, “superlative”, “none”, and “other”, respectively. This suggests that the experts generated sentences matching their corresponding logical types, indirectly indicating that the gate module successfully classified sentences during training. The second example, which is more straightforward, is drawn from the table-based fact verification task. The gray part represents the statements requiring verification. In this task, the gate module directly classifies the logical type of these statements, predicting the results as “superlative”, “superlative”, “count”, “superlative”, and “count”—all of which are correct. This case provides a clearer view of the classification capabilities of the gate module, proving that it achieves a relatively high level of accuracy. Furthermore, the assignment by the gate module enables the experts to better fulfill their respective roles. Our MoE-based framework differentiates the experts according to logical types, aligning naturally with the diverse operations required for table reasoning. This not only boosts the model’s predictive performances but also grants a degree of interpretability, a feature that is difficult for other models to achieve.

6.4. Limits

The first limitation of the TabMoE resides in its unsuitability for scenarios involving extremely small datasets. As the TabMoE does not rely on tabular pre-trained models, it lacks prior knowledge for various operations conducted on tables. In cases where the training data are insufficient, this can lead to a lack of differentiation among the experts, resulting in a reduced overall performance enhancement. Another limitation is related to its suboptimal handling of large tables. Constrained by the input length of the pre-trained model, we opt to remove some irrelevant columns to compress the table during the pre-processing step. However, when confronted with tables abundant in pertinent information, such a process compromises models’ efficacy on target tasks. Addressing these limitations represents potential directions for future work to expand the TabMoE’s capabilities.

7. Conclusions

In this paper, we present the TabMoE, a general framework applicable to diverse tasks associated with logical reasoning over tables. Employing mixture-of-experts as foundation, this framework equips each expert with proficiency in specific logical operations, and facilitates their training through an unsupervised hard EM algorithm. We conduct experiments on three disparate table-related tasks: table-based question-answering, table-based fact detection, and table-to-text generation. The results across these classification and generation tasks demonstrate the TabMoE’s superior performance and wide applicability. Remarkably, the TabMoE dispenses with any reliance on tabular pre-trained models, achieving outstanding results solely through task-specific datasets. Abundant analytical experiments highlight the framework’s efficacy and the functionality of its individual modules. This method not only resolves tasks related to tabular data but also offers a viable solution for tasks involving multiple data types or requiring multifaceted functionalities. In prospective research, we aspire to tackle more difficult logical reasoning, address the decomposition of complex operations over table, and devise strategies for combining experts to handle intricacies effectively.

Author Contributions

Conceptualization, J.W.; methodology, J.W.; software, J.W.; writing—original draft preparation, J.W.; writing—review and editing, M.H.; supervision, M.H.; funding acquisition, M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 62072075.

Data Availability Statement

All data that support the findings of this study are cited and publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Yin, P.; Neubig, G.; Yih, W.T.; Riedel, S. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8413–8426. [Google Scholar] [CrossRef]
Herzig, J.; Nowak, P.K.; Müller, T.; Piccinno, F.; Eisenschlos, J. TaPas: Weakly Supervised Table Parsing via Pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4320–4333. [Google Scholar] [CrossRef]
Iida, H.; Thai, D.; Manjunatha, V.; Iyyer, M. TABBIE: Pretrained Representations of Tabular Data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 3446–3456. [Google Scholar] [CrossRef]
Wang, Z.; Dong, H.; Jia, R.; Li, J.; Fu, Z.; Han, S.; Zhang, D. TUTA: Tree-based Transformers for Generally Structured Table Pre-training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, Singapore, 14–18 August 2021; pp. 1780–1790. [Google Scholar] [CrossRef]
Eisenschlos, J.; Gor, M.; Müller, T.; Cohen, W. MATE: Multi-view Attention for Table Transformer Efficiency. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7606–7619. [Google Scholar] [CrossRef]
Puigcerver, J.; Ruiz, C.R.; Mustafa, B.; Renggli, C.; Pinto, A.S.; Gelly, S.; Keysers, D.; Houlsby, N. Scalable Transfer Learning with Expert Models. In Proceedings of the International Conference on Learning Representations, Online, 19–21 February 2021. [Google Scholar]
Ravaut, M.; Joty, S.; Chen, N. SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 4504–4524. [Google Scholar] [CrossRef]
You, Z.; Feng, S.; Su, D.; Yu, D. Speechmoe2: Mixture-of-Experts Model with Improved Routing. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7217–7221. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, X.; Zhou, K.; Wu, J. Table-based Fact Verification with Self-adaptive Mixture of Experts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 139–149. [Google Scholar] [CrossRef]
Mullapudi, R.T.; Mark, W.R.; Shazeer, N.; Fatahalian, K. Hydranets: Specialized dynamic architectures for efficient inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8080–8089. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
Gururangan, S.; Lewis, M.; Holtzman, A.; Smith, N.A.; Zettlemoyer, L. DEMix Layers: Disentangling Domains for Modular Language Modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 5557–5576. [Google Scholar] [CrossRef]
Li, M.; Gururangan, S.; Dettmers, T.; Lewis, M.; Althoff, T.; Smith, N.A.; Zettlemoyer, L. Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models. In Proceedings of the First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Du, N.; Huang, Y.; Dai, A.M.; Tong, S.; Lepikhin, D.; Xu, Y.; Krikun, M.; Zhou, Y.; Yu, A.W.; Firat, O.; et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. In Proceedings of the 39th International Conference on Machine Learning, Online, 17–23 July 2022; Volune 162, pp. 5547–5569. [Google Scholar]
Roller, S.; Sukhbaatar, S.; Szlam, A.; Weston, J. Hash Layers For Large Sparse Models. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17555–17566. [Google Scholar]
Clark, A.; De Las Casas, D.; Guy, A.; Mensch, A.; Paganini, M.; Hoffmann, J.; Damoc, B.; Hechtman, B.; Cai, T.; Borgeaud, S.; et al. Unified Scaling Laws for Routed Language Models. In Proceedings of the 39th International Conference on Machine Learning, Online, 17–23 July 2022; Volume 162, pp. 4057–4086. [Google Scholar]
Lewis, M.; Bhosale, S.; Dettmers, T.; Goyal, N.; Zettlemoyer, L. BASE Layers: Simplifying Training of Large, Sparse Models. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; Volume 139, pp. 6265–6274. [Google Scholar]
Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; Wang, W.Y. TabFact: A Large-scale Dataset for Table-based Fact Verification. In Proceedings of the International Conference on Learning Representations, Online, 26–30 April 2020. [Google Scholar]
Zhong, V.; Xiong, C.; Socher, R. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv 2017, arXiv:1709.00103. [Google Scholar]
Chen, W.; Chen, J.; Su, Y.; Chen, Z.; Wang, W.Y. Logical Natural Language Generation from Open-Domain Tables. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7929–7942. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Liang, C.; Norouzi, M.; Berant, J.; Le, Q.; Lao, N. Memory augmented policy optimization for program synthesis and semantic parsing. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 10015–10027. [Google Scholar]
Agarwal, R.; Liang, C.; Schuurmans, D.; Norouzi, M. Learning to Generalize from Sparse and Under specified Rewards. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 130–140. [Google Scholar]
Wang, B.; Titov, I.; Lapata, M. Learning Semantic Parsers from Denotations with Latent Structured Alignments and Abstract Programs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3774–3785. [Google Scholar] [CrossRef]
Min, S.; Chen, D.; Hajishirzi, H.; Zettlemoyer, L. A Discrete Hard EM Approach for Weakly Supervised Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2851–2864. [Google Scholar] [CrossRef]
Yu, T.; Wu, C.S.; Lin, X.V.; Wang, B.; Tan, Y.C.; Yang, X.; Radev, D.; Socher, R.; Xiong, C. GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Zhong, W.; Tang, D.; Feng, Z.; Duan, N.; Zhou, M.; Gong, M.; Shou, L.; Jiang, D.; Wang, J.; Yin, J. LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6053–6065. [Google Scholar] [CrossRef]
Shi, Q.; Zhang, Y.; Yin, Q.; Liu, T. Learn to Combine Linguistic and Symbolic Information for Table-based Fact Verification. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 5335–5346. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Wang, S.; Cao, X.; Zhang, F.; Wang, Z. Table Fact Verification with Structure-Aware Transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1624–1629. [Google Scholar] [CrossRef]
Yang, X.; Nie, F.; Feng, Y.; Liu, Q.; Chen, Z.; Zhu, X. Program Enhanced Fact Verification with Verbalization and Graph Attention Network. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 7810–7825. [Google Scholar] [CrossRef]
Dong, R.; Smith, D. Structural Encoding and Pre-training Matter: Adapting BERT for Table-Based Fact Verification. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 2366–2375. [Google Scholar] [CrossRef]
Eisenschlos, J.; Krichene, S.; Müller, T. Understanding tables with intermediate pre-training. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 281–296. [Google Scholar] [CrossRef]
Yang, X.; Zhu, X. Exploring Decomposition for Table-based Fact Verification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1045–1052. [Google Scholar] [CrossRef]
Zhao, Y.; Qi, Z.; Nan, L.; Flores, L.J.; Radev, D. LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 554–561. [Google Scholar] [CrossRef]
Perlitz, Y.; Ein-Dor, L.; Sheinwald, D.; Slonim, N.; Shmueli-Scheuer, M. Diversity Enhanced Table-to-Text Generation via Type Control. arXiv 2022, arXiv:2205.10938. [Google Scholar]
Nan, L.; Flores, L.J.; Zhao, Y.; Liu, Y.; Benson, L.; Zou, W.; Radev, D. R2D2: Robust Data-to-Text with Replacement Detection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 9–11 December 2022; pp. 6903–6917. [Google Scholar] [CrossRef]
Zhu, Y.; Lu, S.; Zheng, L.; Guo, J.; Zhang, W.; Wang, J.; Yu, Y. Texygen: A Benchmarking Platform for Text Generation Models. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 11–15 March 2018; pp. 1097–1100. [Google Scholar] [CrossRef]

Figure 1. The inputs and outputs for three table-related tasks: question answering, fact verification, and table-to-text generation.

Figure 2. The overall framework of the TabMoE. The table–text pair is input into the bottom encoder for semantic representation, then it goes through the gate module to select and activate a suitable expert, and the selected expert performs corresponding logical reasoning for the final output. All the encoders and experts are constructed by several transformer layers.

Figure 3. The impact of different numbers of experts.

Figure 4. The accuracy of different logical types on TabFact.

Figure 5. The proportion of data being correctly predicted by at least k experts.

Figure 6. The first case is drawn from LogicNLG dataset, where the gray parts represent the sentences generated by the TabMoE and the GPT-Coarse-to-Fine model. The second is taken from TabFact dataset, with the gray parts representing the statements that require verification and annotated labels. For clarity of presentation, both tables omit some content.

Table 1. The symbols used in methodology.

Symbols	Description	Symbols	Description
$T$	Input table	$x$	Input after pre-processing
$T^{*}$	Linearized table in sequence	$y$	Target output
$t_{r, c}$	Cell at row r and column c	$z$	Latent variable for experts
w	Table title	$θ$	Trainable parameters
q	Question in question answering	$f_{L}$	Bottom encoder
s	Statement in fact verification	$f_{e_{i}}, f_{m l p_{e_{i}}}, d_{e_{i}}$	Expert module
K	Number of experts	$f_{G}, f_{m l p_{G}}$	Gate module
$d_{G}$	Probability of logical types	$H, h_{e_{i}}, h_{G}$	Hidden vectors in model

Table 2. Basic statistics of datasets.

Task	Input	Output	Dataset	#Table	#Sentence
Table-based question answering	Table + Text	Text	WikiSQL	24,241	80,654
Table-based fact verification	Table + Text	Label	TabFact	16,573	118,275
Table-to-text generation	Table	Text	LogicNLG	7392	37,015

Table 3. Denotation accuracies on WikiSQL.

Model	Val	Test
MAPO [27]	71.8	72.4
MeRL [28]	74.9	74.8
Wang et al. [29]	79.4	79.3
Min et al. [30]	84.4	83.9
Grappa [31]	85.9	84.7
BART-Large	87.5	86.2
TAPAS-Large [4]	88.0	86.4
TabMoE	89.6 ± 0.2	89.2 ± 0.4

Table 4. Accuracies on TabFact.

Model	Val	Test	Simple	Complex	Small
Human	-	-	-	-	92.1
LPA [21]	65.1	65.3	78.7	58.5	68.9
Table-BERT [21]	66.1	65.1	79.1	58.2	68.1
LogicalFactChecker [32]	71.8	71.7	85.4	65.1	74.3
HeterTFV [33]	72.5	72.3	85.9	65.7	74.2
SAT [34]	73.3	73.2	85.5	65.2	-
ProgVGAT [35]	74.9	74.4	88.3	67.6	76.2
TAPAS-Row-Col [36]	-	76.0	89.0	69.8	-
BART-Large	81.4	81.0	91.0	76.2	82.8
TAPAS-Large [37]	81.5	81.2	93.0	75.5	84.1
Decomp [38]	82.7	82.7	93.6	77.4	84.7
TabMoE	84.7 ± 0.2	84.6 ± 0.3	93.2 ± 0.1	80.4 ± 0.3	86.1 ± 0.2

Table 5. Performances on LogicNLG.

Model	Surface-Level Fidelity			Logical Fidelity
Model	BLEU-1	BLEU-2	BLEU-3	SP-Acc	NLI-Acc
GPT-TabGen [23]	49.6	28.2	14.2	44.7	74.6
GPT-Coarse-to-Fine [23]	49.0	28.3	14.6	45.3	76.4
LoFT [39]	48.1	27.7	14.9	57.7	86.9
DEVTC [40]	50.8	29.2	15.2	45.6	77.0
BART-TabGen	53.3	32.7	18.3	48.5	86.1
R2D2 [41]	51.8	32.4	18.6	50.8	85.6
TabMoE	54.4 ± 0.5	33.4 ± 0.3	19.3 ± 0.3	49.0 ± 0.4	82.4 ± 0.3

Table 6. Diversity of generated sentences on LogicNLG.

Model	Self-BLEU-1/2/3/4
GPT-TabGen	75.3/66.8/60.3/55.3
BART-TabGen	74.0/65.6/59.4/54.1
GPT-Coarse-to-Fine	73.4/65.0/58.5/52.8
TabMoE	59.0/43.5/32.8/25.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, J.; Hou, M. TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts. Mathematics 2024, 12, 3031. https://doi.org/10.3390/math12193031

AMA Style

Wu J, Hou M. TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts. Mathematics. 2024; 12(19):3031. https://doi.org/10.3390/math12193031

Chicago/Turabian Style

Wu, Jie, and Mengshu Hou. 2024. "TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts" Mathematics 12, no. 19: 3031. https://doi.org/10.3390/math12193031

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TabMoE: A General Framework for Diverse Table-Based Reasoning with Mixture-of-Experts

Abstract

1. Introduction

2. Related Work

2.1. Tabular Data Representation

2.2. MoE

3. Problem Definition

4. Methodology

4.1. Table Pre-Processing and Representation

4.2. Experts

4.3. Training

5. Experiment Setup

5.1. Datasets and Evaluations

5.2. Experimental Settings

6. Results

6.1. Overall Performance

6.2. Analysis

6.3. Case Study

6.4. Limits

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI