IV-Nlp: A Methodology to Understand the Behavior of DL Models and Its Application from a Causal Approach
Abstract
:1. Introduction
1.1. Causal Inference in NLP
1.2. Instrumental Variable Method
1.3. Synthetic Data Generation
2. Related Work
2.1. Causal Inference
2.2. Instrumental Variable Method
2.3. Synthetic Data Generation in NLP
3. Methodology
3.1. First Approach
3.2. Second Approach
4. Application
4.1. Data Preparation
4.1.1. Original Data
4.1.2. Synthetic Data
- Input—Original texts: Six (06) data subsets were generated, including argumentative and non-argumentative texts for each training, test, and validation data set. A small sample of each data subset was selected to ensure that the results generated by the GPT-3.5-turbo-0125 model responded to the research needs.
- Intervention Method: An iterative process of adjustments, modifications, and improvements to the model was carried out to achieve results that were acceptable regarding the study’s objectives. These adjustments included parameter adjustment, the elaboration of specific inputs with precise examples to provide context to the model, rules to specify the end of the generated sentence, and the specific configuration of the parameters temperature, max_tokens, and stop. A variation of the standard RAG (Retrieval-Augmented Generation) technique was implemented. Text documents were created as discourse markers, providing a list of authorities to introduce new information and mitigate the generation of false synthetics that the model could eventually generate. This process was repeated as often as necessary until acceptable results were achieved.
- Output—texts generated by the model: The results generated by the model were saved.
- Validation of texts generated by the model: Each time errors were identified during the validation of the results produced by the model for each of the six (06) data sets, the records in the data set were relocated (if the text generated by the model was incorrect, meaning the model generated a non-argumentative text when it should have generated an argumentative text and vice versa) and/or eliminated (if the text generated by the model was intelligible, contained up to five tokens, or was a synthetic false text, then it was eliminated to avoid manipulating the data and avoid possible biases). This method was chosen not only to rationally and efficiently use all the records in the data sets but also to calculate the expected value of the counterfactual for the intervention that was carried out. Finally, the argumentative and non-argumentative records correctly generated by the model for each of the six (06) data sets were totaled.
4.1.3. Recovering Features from Original and Synthetic Data Sets
4.2. Deep Learning (DL) Models
4.2.1. Training DL Models with Original Data
4.2.2. Generating Predicted Data Set to Assess the Causal Effect
4.3. Causal Inference Pipeline Using the IV Method
4.3.1. Phase 1: Design and Implement the IV Method with the Original Data
- Step 1: Selecting the attribute that will fulfil the role of “Instrument”This method will provide an estimate of the causal effect of “Text” on the “Target Class” using the attribute “Date” in this research. This step establishes the publication date’s relevance (pertinence) and exogeneity as an instrumental variable. To demonstrate that the choice of the attribute “Date” as an instrument was correct, it was verified that “Date” as an instrumental variable (Z = Date) meets the following criteria: relevance concerning its correlation with X (X = Text)—that is, Z must have some relation to the content of variable X; and exogeneity, meaning Z must not be correlated with variable Y (Y = Target Class) and that Z must not have a direct effect on Y or be correlated with unobserved factors that influence Y.
- –
- Step 1.a: Data preparation with temporality and topicsTopics were assigned to the texts by time period (every 02 months) based on the publication date. The LDA (Latent Dirichlet Allocation) model was implemented to identify the topics or themes present in the text throughout the study period (2020–2022). Due to the number of records in the data set (4015 instances), 05 topics were defined for evaluation. The topics generated by the model were the following (the first six generated tokens are included):
- *
- Topic 0: elections vote Castillo Keiko second round;
- *
- Topic 1: constituent assembly elections president congress new constitution;
- *
- Topic 2: elections electoral vote onpe security local;
- *
- Topic 3: elections presidential party candidate congress peru;
- *
- Topic 4: elections electoral jne new constitution constituent assembly.
- –
- Step 1.b: Checking the relevance of the publication dateThis section demonstrates that the feature ‘Date’ correlates with the feature ’Tweet_Checked’ (original text). That is, the influence of “Date” on the content of the “original text” is evaluated and verified. A correlation analysis was performed to demonstrate the relevance of “Date” in relation to the content of the topics for which the regression models were implemented where the proportions of the topics were defined as dependent variables and the date (in numerical format) was defined as an independent variable. The regression model used to analyze the impact of “Date” was the OLS (Ordinary Linear Regression) of the “statsmodels” API. The “seaborn” library was used to better visualize the results. Figure 4 shows the results obtained for each topic.The values shown in Table 6 show that all p-values are less than 0.005, so date has a statistically significant influence on each of the topics; therefore, it was inferred that date is a relevant (pertinent) factor for predicting textual content in terms of the generated topics. Likewise, the coefficient values are small, which could be due to the scale of the date variable. However, the “p” values shown by the F-Statistic parameter indicate that this relationship is robust and reliable. Finally, the R-squared for each topic is low, which indicates that although the Date is significant, it only explains a small fraction of the variability in each topic. This is common in textual analysis, where multiple factors influence the content. In conclusion, it was demonstrated that there are clear differences between the variables studied regarding the variability of each topic analyzed between the years 2020 and 2022, demonstrating the influence of ‘Date’ on ‘Text’.
- –
- Step 1.c: Verify that the publication date has no direct influence on the classThis step shows no correlation between date and class, reinforcing the validity of ‘date’ as an IV. The linear regression of class against date was performed using the OLS model. Table 7 shows the results. The correlation value between ‘date’ and ‘class’ is 0.0833 (Figure 5).The values shown in Table 7 show that the R-squared value indicates that only 7% of the variability in the class variable is explained by the date, which is a very low value and shows that date has no significant relationship with class. On the other hand, F-Statistic has a value of 28.08 with a p-value of , which indicates that, in general terms, the model is significant, but does not imply that date is a good variable to explain class because the R-Squared is too low. The value of the constant is −7.5862, which indicates that when date is 0, the class value is negative, reinforcing the idea that date has neither a positive nor a negative impact on class. Finally, the Durbin–Watson value is 1.792 (close to 2), showing no residual autocorrelation in the model. In conclusion, the date explains virtually none of the variability in the class, as demonstrated by the R-Squared result, so date has no significant effect on class.
- Step 2: Estimating the causal effect using ‘date’ as an instrumental variable (IV)The instrumental regression was performed in the following two (02) steps:
- –
- Step 2.a: Regression of X (topic) on Z (date)This first stage of the regression allows us to estimate the part of X determined by Z. The predictions of this regression are stored in a data frame and are the “instrumented” versions of X; that is, they are free of endogeneity (Table 8).
- –
- Step 2.b: Regression of Y (class) on instrumented XThe saved predictions (instrumented versions of X) were used to predict Y in a second regression. This provided an unbiased estimate of the causal effect of X on Y. The results are shown in Table 9.
- Step 3: The analysis and interpretation of the results in Table 9 is developed in Section 5.1.1, Original Data.
4.3.2. Phase 2: Design and Implement the IV Method with the Synthetic Data
- Topic 0: constitution constituent assembly new elections fundamental;
- Topic 1: elections fundamental onpe voting guarantee process;
- Topic 2: elections proposals party candidates congress candidate;
- Topic 3: elections peru according to important presidential castle;
- Topic 4: peru elections constituent fundamental important castle.
- The OLS model yielded a very low value (0.019), which indicates a high autocorrelation in the model’s residuals. This represented a potential problem for the validity of the results, i.e., the model could bias the inferences, so it was necessary to experiment with other more robust models to mitigate the problem of autocorrelation of the model’s errors.
- The Logistic Regression (LR) model generated 35 iterations. By exceeding the number of iterations, the model failed to converge towards stable results, so this model did not obtain the expected results either.
- Finally, a robust regression model (RLM) was implemented using the HuberT standard to manage the sensitivity of outliers, which are associated with the quality of the data generated by the GPT-35-turbo-0125 model (Table 13). This choice was based on the model’s ability to balance sensitivity and robustness, its resistance to outliers, the lower complexity in its application in exploratory analyses, and its low computational resource requirements. This technique was originally proposed in 1964 and was reissued in 1992 [37].
4.3.3. Phase 3: Design and Implement of the IV Method for the Predicted Data Set (Testing)
- Step 1: Calculate the proportion of topics related to each record.This step was repeated in each of the three best models. The proportion of each topic present in each record of the predicted data set (testing) was calculated. The keywords for each of the five (05) topics determined when estimating the global causal effect were assigned and counted (Section 4.3.1). The proportions were separated by column, and the count of the keywords for each topic was divided by the total number of relevant words for each record. We calculated the causal effect of each topic on the predicted data set and compare this with the results obtained in the analysis of the global causal effect.
- Step 2: Regression of X (proportion of each topic) on Z (date in numerical format).This first stage of the regression (Ordinary Least Squares (OLS) regression) allows us to estimate the part of X that is determined by Z. These predictions from this regression are stored in a data frame, which are the “instrumented” versions of X; that is, they are free of endogeneity.
- Step 3: Regression of Y (class) on instrumented X.The saved predictions (instrumented versions of X) were used to predict Y in a second regression (the Robust Linear Regression (RLM) model was implemented with the Hubert norm). This provided an unbiased estimate of the causal effect of X on Y. The results are shown in the following tables: Table 14, Table 15 and Table 16 for the best, second-best, and third-best models, respectively.The results after estimating the causal effect on the predicted data set of the Best Model (CNN-LSTM-MLP using the Cyclic Learning Rate method with automatic saving during the training process) are as follows (Table 14).The results of estimating the causal effect on the predicted data set of the second-best model (CNN) are as follows (Table 15).The results when estimating the causal effect on the predicted data set of the third-best model (CNN-LSTM-MLP using the cyclic learning rate method) are as follows (Table 16).
- Step 4: Analysis and interpretation of the results.Section 5.3.1 and Section 5.3.2 of Section 5 develops the analysis and interpretation of the results (Table 14 and Table 16). The analysis and interpretation of Table 15 is like that shown in Table 14.
5. Analysis and Interpretation of Results
5.1. Causal Effect Estimation on the Global Data Set
5.1.1. Original Data
- Regarding the coefficients: The coefficients show the estimated impact of each topic (instrument) on the dependent variable (argument_class). Each positive coefficient reflects a direct relationship, while a negative one reflects an inverse relationship. The p > t values indicate whether the effect of each topic is significant, so Predicted_Topic_0, Predicted_Topic_1, Predicted_Topic_2, and Predicted_Topic_4 have values less than 0.05, indicating that their effects on Argument_Class are significant. However, Predicted_Topic_3 is not statistically significant (p-value = 0.130), indicating that this topic might not have a relevant effect on the dependent variable.
- Regarding the sign and magnitude of coefficients, Predicted_Topic_0, Predicted_Topic_1, and Predicted_Topic_4 have positive coefficients, which suggests a direct association with Argument_Class. This indicates that an increase in these topics predicts an increase in the probability of their belonging to a specific Argument_Class class. Predicted_Topic_2 has a negative coefficient and is significant, indicating an inverse association with Argument_Class.
- Regarding the descriptive statistics, the Durbin–Watson statistic is 1.792 (very close to 2), which is quite positive because it shows that there is no residual autocorrelation (neither positive nor negative) in the model; that is, the null hypothesis is not rejected (there is no autocorrelation in the ut perturbation). Regarding the descriptive R-squared (0.007), it explains only 0.7% of the variability of Argument_Class, suggesting that although some of the topics have significant relationships with Argument_Class, the total variation in the Argument_Class explained by the model is low.
- Regarding the interpretation of the causal effect, these coefficients indicate that there is a significant relationship between certain topics, such as Predicted_Topic_0, Predicted_Topic_1, Predicted_Topic_2, and Predicted_Topic_4 and Argument_Class, to the extent that changes in the Predicted_Topics affect Argument_Class. To clarify the results obtained in the coefficients of each topic regarding its causal effect on Argument_Class, 95% confidence intervals were included so that they could be visualized graphically and the results and significance could be appreciated (Figure 7).In Figure 7, positive coefficients are presented above the zero (0) dotted line, indicating a positive (direct) relationship with Argument_Class, while negative coefficients indicate an inverse relationship. Also, we can see that the coefficient of Predicted_Topic_3 includes the zero (0) dotted line of the confidence interval (which is close to zero), indicating that this coefficient is not statistically significant, which coincides with its p-value of 0.130 (Table 9).To complete the analysis and interpretation of the causal effect results, individual scatter plots were implemented for each of the significant topics (Predicted_Topic_0, Predicted_Topic_1, Predicted_Topic_2, and Predicted_Topic_4) for the Argument_Class. Likewise, the regression line was added to show the trend of each Predicted_Topic concerning Argument_Class (Figure 8).From Figure 8, the following interpretation can be made:The red line helps us visualize the average effect of each Predicted_Topic on the Argument_Class and indicates the type and strength of the relationship. The slope of the line reflects the direction and magnitude of the association between the Predicted_Topic and Argument_Class. The dispersion of the blue points around the red line suggests variability in the data: if the points are close to the line, it implies a stronger relationship, i.e., less variability, while if the points are more dispersed, then the relationship is weaker and there could be other factors affecting the Argument_Class. The red line mainly shows the central tendency of the relationship; for example, a positive slope indicates that as the Predicted_Topic increases, then the Argument_Class also increases, while a negative slope indicates the opposite. The graph shows how this pattern changes concerning Argument_Class when Predicted_Topic varies. For example, we can see that Predicted_Topic_0, Predicted_Topic_1, and Predicted_Topic_4 maintain positive slopes while Predicted_Topic_2 maintains a negative slope, as shown in the graph of Predicted_Topic concerning Argument_Class with a 95% confidence interval (Figure 7), which shows that the model is robust and there is no deviation in its results regarding the estimation of the causal effect.
5.1.2. With Synthetic Data
- Predicted_Topic_0: The increases in this topic are positively related to the probability of the argument_class. It has a positive and highly significant coefficient (p < 0.05). This implies that as the values of these topics increase, the probability of Argument_Class = 1 also increases.
- Predicted_Topic_1: Reflects a negative and weakly significant relationship with Argument_Class. A negative and marginally significant coefficient (p = 0.043) suggests that this topic has a slightly negative relationship with Argument_Class = 1; that is, an increase in this predictor variable decreases the probability of Argument_Class = 1.
- Predicted_Topic_2: Has no significant effect on Argument_Class. This predictor does not contribute in a relevant way to the model’s ability to determine Argument_Class.
- Predicted_Topic_3 and Predicted_Topic_4: Both predictors reflect a positive relationship with Argument_Class. They have positive and very significant coefficients (p < 0.05). This implies that as the values of these topics increase, the probability of Argument_Class = 1 also increases.
- The intercept (const) of 0.4149 (41.49%) represents the probability of the base (Argument_Class = 1) when all predictors are 0. In this case, it is close to 50%, suggesting a relatively balanced result for Argument_Class = 1 versus Argument_Class = 0.
5.1.3. Discrepancies in the Behaviour of the Models Concerning the Estimation of the Causal Effect
- Synthetic data quality: The GPT-35-turbo-0125 model generated a significant degree of homogeneity in the generated words (e.g., important, fundamental, some discourse markers, and linking words), which may have reduced the LDA model’s ability to identify better-differentiated topics. This also confirms the current limitations of synthetic data in domain-specific contexts. This relates to the extreme collinearity in the Predicted_Topic, which made interpretation and analysis difficult.
- Impact of collinearity: The coefficient comparison graph of the Simple OLS Regression model and the Robust RLM regression model reinforces LSR was an appropriate choice ffor the final model at the second regression stage, as it successfully addressed these issues.
- Visualization of complexity: The patterns between topics and classes are not linear in the individual scatter plots for each predicted topic, including smoothing curves (Figure 10). This non-linearity may be another source of the discrepancy observed in the Predicted_Topic_3.
5.2. Performance During the Retraining of the Three Best DL Models
5.2.1. Behavior and Performance of the Best Model
5.2.2. Comparison of Performance Metrics of the Top Three Models
5.3. Comparison of Global Causal Effects with Causal Effects of Predicted Subsets
5.3.1. The Best Model
- The coefficients show the estimated impact of each topic (instrument) on the dependent variable (Class_Argument). Each positive coefficient reflects a direct relationship, while a negative one reflects an inverse relationship. Predicted_Topic_0_Proportion, Predicted_Topic_1_Proportion, and Predicted_Topic_4_Proportion have values less than 0.05, indicating that their effects on Class_Argument are significant. However, Predicted_Topic_2_Proportion and Predicted_Topic_3_Proportion are not statistically significant (p-values of 0.062 and 0.152, respectively), which indicates that these particular topics might not have a relevant effect on the dependent variable.
- Regarding the sign and magnitude of the coefficients, Predicted_Topic_0_Proportion, Predicted_Topic_1_Proportion, and Predicted_Topic_4_Proportion have positive coefficients, suggesting a direct association with Argument_Class. Predicted_Topic_2_Proportion and Predicted_Topic_3_Proportion have negative coefficients and are significant, indicating an inverse association with Argument_Class.
- Regarding the interpretation of the causal effect, these coefficients indicate that there is a significant relationship between certain topics, such as Predicted_Topic_0, Predicted_Topic_1, and Predicted_Topic_4 with Argument_Class, to the extent that changes in the Predicted_Topics affect the Argument_Class.
5.3.2. The Third-Best Model
- Coefficients: Predicted_Topic_0_Proportion, Predicted_Topic_1_Proportion, Predicted_Topic_2_Proportion, and Predicted_Topic_4_Proportion have values less than 0.05, which indicates that their effects on Argument_Class are significant. However, Predicted_Topic_3_Proportion was not statistically significant (p-value of 0.113), indicating that this particular topic might not have a relevant effect on the dependent variable.
- Regarding the sign and magnitude of coefficients, Predicted_Topic_0_Proportion, Predicted_Topic_1_Proportion, and Predicted_Topic_4_Proportion have positive coefficients, suggesting a direct relationship with Argument_Class, while Predicted_Topic_2_Proportion has a negative coefficient, reflecting an inverse relationship with Argument_Class. This indicates that an increase or decrease in these topics predicts an increase or decrease in the probability of the topic belonging to a specific Argument_Class class.
- Regarding the interpretation of the causal effect, these coefficients indicate that there is a significant relationship in topics such as Predicted_Topic_0_Proportion, Predicted_Topic_1_Proportion, Predicted_Topic_2_Proportion, Predicted_Topic_4_Proportion with Argument_Class, to the extent that changes in these Predicted_Topics_Proportion affect Argument_Class.
6. Discussion
6.1. From the Perspective of Causal Inference and Estimation
6.2. From the Perspective of the Characteristics of the Data Set
6.3. From the Perspective of Synthetic Data Generation and Causal Inference Techniques
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Recovering Features from Original and Synthetic Data Sets
- Step 1: Recovering ‘Date’ and ‘Id’ features to the original datasets (Figure A1)The data set format was converted (from ‘utf-8’ to ‘latin-1’) with the attributes Id, Text, and Class_Argument. The attribute ‘Date’ was retrieved from the annotated data set. The Pandas merge library was used and 2895 records were obtained, with the ‘Date’ characteristic being retrieved automatically. The assignment of the ‘Date’ and the ‘ID’ of the rest of the records was carried out manually. The manual verification of the records was necessary because the IDs were compared with their respective texts, so a significant effort was made in terms of time by the annotators to guarantee the quality of the data.
- Step 2: Feature recovery on the six (06) original datasets distributed from the global corpus (Figure A2)An automated procedure was implemented using the Pandas (version 2.1.4) libraries with the Keras (version 3.4.1) framework to filter the global data set generated in Step 1. Duplicate records were eliminated, and subsequently, the verification of each feature recovery process was carried out for each of the six (06) data sets, finding only one record from the set of argumentative texts of the training set in which the match could not be established because it was not located in the global data set of Step 1; this was eliminated, leaving the set of argumentative texts of the “Training” set with 1526 records with their attributes duly recovered. No incidence was found in the rest of the data sets, and the attributes of the six (06) data sets were successfully recovered (see Figure A3).
- Step 3: Feature recovery on the six (06) datasets with synthetic texts generated by the GPT-3.5-turbo-0125 modelThis step was carried out in two (02) phases:
- –
- Phase 1: Assignment of ‘Date’ and ‘Id’ to the six (06) sets of synthetic texts generated by the GPT35-turbo-0125 model (OUTPUT of the Intervention Method defined in Figure 2). Figure A3 graphically shows the procedure carried out in this phase.The process consisted of adding the columns ‘Date’ and ‘Id’ according to the index position of the Excel file of the data set, ensuring that there was the same number of records in both data sets. This was because code routines were implemented so that the GPT3.5-turbo-0125 model would generate an alternative text for each text in the data set (See Figure 2). During the process of assigning the attributes ‘Date’ and ‘Id’ to the data sets with generated texts, only one incident was found, so two records were eliminated from the set of argumentative texts in the training set because the GPT3.5-turbo-0125 model generated two different textual records for the same record for these two records. In the remaining five (05) sets of generated texts, there was no incidence because the model generated, in all cases, only one alternative text in front of each of the records of the original sets of texts. Table 4 shows the results obtained. The data sets for the generated texts had the following structure: Date, Original_Id, Tweet_Checked, Clase_Argumento (where Original_Id is the Id value of the data sets with original texts, which will be helpful in terms of data traceability during the data interpretation stage).
- –
- Phase 2: Assignment of ‘Date’ and ‘Id’ to the six (06) sets of synthetic texts generated by the GPT35-turbo-0125 model (validation of texts generated by the model defined in the intervention method described in Figure 2). Figure A3 graphically shows the procedure carried out in this phase.Records whose generated texts were the same as those of the texts generated once they passed through data validation at the end of the intervention method were filtered out. This process was repeated for each pair of data sets (training, testing, and validation). For the training set, the attributes ‘Date’ and ‘Id’ were automatically recovered for 1500 and 1537 records in the set of argumentative and non-argumentative texts, respectively. A manual verification of the sets of argumentative and non-argumentative texts was then performed to complete the attributes ‘Date’ and ‘Id’ in the 25 and 27 missing records, respectively, to correctly assign these two attributes in line with the intervention method detailed in Figure 2. The procedure performed for the training set was replicated for the test and validation sets. Likewise, no incidents were presented.
References
- Ludwig, J.; Mullainathan, S. Machine Learning as a Tool for Hypothesis Generation. Q. J. Econ. 2024, 139, 751–827. [Google Scholar] [CrossRef]
- Scholkopf, B.; Locatello, F.; Bauer, S.; Ke, N.R.; Kalchbrenner, N.; Goyal, A.; Bengio, Y. Toward Causal Representation Learning. Proc. IEEE 2021, 109, 612–634. [Google Scholar] [CrossRef]
- Spirtes, P. Introduction to causal inference. J. Mach. Learn. Res. 2010, 11, 1643–1662. [Google Scholar]
- Yang, J.; Han, S.C.; Poon, J. A survey on extraction of causal relations from natural language text. Knowl. Inf. Syst. 2022, 64, 1161–1186. [Google Scholar] [CrossRef]
- Feder, A.; Oved, N.; Shalit, U.; Reichart, R. CausaLM: Causal Model Explanation Through Counterfactual Language Models. Comput. Linguist. 2021, 47, 333–386. [Google Scholar] [CrossRef]
- Jiao, L.; Wang, Y.; Liu, X.; Li, L.; Liu, F.; Ma, W.; Guo, Y.; Chen, P.; Yang, S.; Hou, B. Causal Inference Meets Deep Learning: A Comprehensive Survey. Research 2024, 7, 467. [Google Scholar] [CrossRef]
- Chen, Y.; Bühlmann, P. Domain adaptation under structural causal models. J. Mach. Learn. Res. 2021, 22, 1–80. [Google Scholar]
- Bound, J.; Jaeger, D.A.; Baker, R.M. Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J. Am. Stat. Assoc. 1995, 90, 443–450. [Google Scholar] [CrossRef]
- Angrist, J.; Imbens, G. Identification and Estimation of Local Average Treatment Effects. 1995. Available online: https://www.nber.org/papers/t0118 (accessed on 10 March 2025).
- Molak, A.; Jaokar, A. Causal Inference and Discovery in Python: Unlock the Secrets of Modern Causal Machine Learning with DoWhy, EconML, PyTorch and More; Packt Publishing Ltd.: Birmingham, UK, 2023; p. 429. [Google Scholar]
- Lu, Y.; Shen, M.; Wang, H.; Wang, X.; van Rechem, C.; Fu, T.; Wei, W. Machine Learning for Synthetic Data Generation: A Review. arXiv 2024, arXiv:2302.04062. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Hicks, M.T.; Humphries, J.; Slater, J. ChatGPT is bullshit. Ethics Inf. Technol. 2024, 26, 38. [Google Scholar] [CrossRef]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
- Wu, A.; Kuang, K.; Xiong, R.; Wu, F. Instrumental Variables in Causal Inference and Machine Learning: A Survey. arXiv 2022, arXiv:2212.05778. [Google Scholar]
- Yao, L.; Chu, Z.; Li, S.; Li, Y.; Gao, J.; Zhang, A. A Survey on Causal Inference. ACM Trans. Knowl. Discov. Data 2021, 15, 1–46. [Google Scholar] [CrossRef]
- Sui, Y.; Wang, X.; Wu, J.; Lin, M.; He, X.; Chua, T.S. Causal Attention for Interpretable and Generalizable Graph Classification. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Washington, DC, USA, 14–18 August 2022; pp. 1696–1705. [Google Scholar] [CrossRef]
- Sun, Y.; Kong, L.; Chen, G.; Li, L.; Luo, G.; Li, Z.; Zhang, Y.; Zheng, Y.; Yang, M.; Stojanov, P.; et al. Causal Representation Learning from Multimodal Biological Observations. arXiv 2024, arXiv:2411.06518. [Google Scholar]
- Liu, F. Data Science Methods for Real-World Evidence Generation in Real-World Data. Annu. Rev. Biomed. Data Sci. 2024, 7, 29. [Google Scholar] [CrossRef]
- Pearl, J. Causality: Models, Reasoning, and Inference; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Angrist, J.D.; Krueger, A.B. Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. J. Econ. Perspect. 2001, 15, 69–85. [Google Scholar] [CrossRef]
- Martens, E.P.; Pestman, W.R.; de Boer, A.; Belitser, S.V.; Klungel, O.H. Instrumental variables: Application and limitations. Epidemiology 2006, 17, 260–267. [Google Scholar] [CrossRef]
- Angrist, J.D.; Pischke, J.S. Mostly Harmless Econometrics: An Empiricist’s Companion; Princeton University Press: Princeton, NJ, USA, 2009. [Google Scholar]
- Imbens, G.W.; Rubin, D.B. Causal Inference in Statistics, Social, and Biomedical Sciences; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar]
- Stock, J.H.; Watson, M.W. Introduction to Econometrics; Pearson: London, UK, 2020. [Google Scholar]
- Wu, K.; Wang, Z.; Zhao, J.; Xu, H.; Hao, T.; Lin, W. Instrumental variables matter: Towards causal inference using deep learning. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Oxford, UK, 2023; Volume 2580, p. 012044. [Google Scholar]
- He, X.; Nassar, I.; Kiros, J.; Haffari, G.; Norouzi, M. Generate, Annotate, and Learn: NLP with Synthetic Text. Trans. Assoc. Comput. Linguist. 2022, 10, 826–842. [Google Scholar] [CrossRef]
- Yang, Y.; Malaviya, C.; Fernandez, J.; Swayamdipta, S.; Le Bras, R.; Wang, J.P.; Bhagavatula, C.; Choi, Y.; Downey, D. Generative Data Augmentation for Commonsense Reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1008–1025. [Google Scholar] [CrossRef]
- Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. In Proceedings of the Empirical Methods in Natural Language Processing: EMNLP 2023, Singapore, 6–10 December 2023; pp. 10443–10461. [Google Scholar] [CrossRef]
- AlKhamissi, B.; Li, M.; Celikyilmaz, A.; Diab, M.; Ghazvininejad, M. A Review on Language Models as Knowledge Bases. arXiv 2022, arXiv:2204.06031. [Google Scholar]
- Gekhman, Z.; Yona, G.; Aharoni, R.; Eyal, M.; Feder, A.; Reichart, R.; Herzig, J. Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Miami, FL, USA, 12–16 November 2024; pp. 7765–7784. [Google Scholar]
- Feder, A.; Keith, K.A.; Manzoor, E.; Pryzant, R.; Sridhar, D.; Wood-Doughty, Z.; Eisenstein, J.; Grimmer, J.; Reichart, R.; Roberts, M.E.; et al. Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond. Trans. Assoc. Comput. Linguist. 2022, 10, 1138–1158. [Google Scholar] [CrossRef]
- Zhou, L.; Schellaert, W.; Martínez-Plumed, F.; Moros-Daval, Y.; Ferri, C.; Hernández-Orallo, J. Larger and more instructable language models become less reliable. Nature 2024, 634, 61–68. [Google Scholar] [CrossRef] [PubMed]
- Guzmán-Monteza, Y. Assessment of an annotation method for the detection of Spanish argumentative, non-argumentative, and their components. Telemat. Inform. Rep. 2023, 11, 100068. [Google Scholar] [CrossRef]
- Guzman, Y.; Tavara, A.; Zevallos, R.; Vega, H. Implementation of a Bilingual Participative Argumentation Web Platform for collection of Spanish Text and Quechua Speech. In Proceedings of the 3rd International Conference on Electrical, Communication and Computer Engineering, ICECCE 2021, Kuala Lumpur, Malaysia, 12–13 June 2021. [Google Scholar] [CrossRef]
- Saxena, C.; Garg, M.; Ansari, G. Explainable Causal Analysis of Mental Health on Social Media Data. In Proceedings of the Neural Information Processing, Indore, India, 22–26 November 2023; pp. 172–183. [Google Scholar]
- Huber, P.J. Robust estimation of a location parameter. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 492–518. [Google Scholar]
- Carloni, G.; Berti, A.; Colantonio, S. The role of causality in explainable artificial intelligence. arXiv 2023, arXiv:2309.09901. [Google Scholar]
- Singh, R.; Sahani, M.; Gretton, A. Kernel instrumental variable regression. Adv. Neural Inf. Process. Syst. 2019, 32, 4595–4607. [Google Scholar]
- Xu, L.; Chen, Y.; Srinivasan, S.; de Freitas, N.; Doucet, A.; Gretton, A. Learning deep features in instrumental variable regression. arXiv 2023, arXiv:2010.07154. [Google Scholar]
- Hartford, J.; Lewis, G.; Leyton-Brown, K.; Taddy, M. Deep IV: A Flexible Approach for Counterfactual Prediction. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; Volume 70, pp. 1414–1423. [Google Scholar]
- Heskes, T.; Sijben, E.; Bucur, I.G.; Claassen, T. Causal shapley values: Exploiting causal knowledge to explain individual predictions of complex models. Adv. Neural Inf. Process. Syst. 2020, 33, 4778–4789. [Google Scholar]
- Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef]
- Wood-Doughty, Z.; Shpitser, I.; Dredze, M. Generating synthetic text data to evaluate causal inference methods. arXiv 2021, arXiv:2102.05638. [Google Scholar]
- Suhaeni, C.; Yong, H.S. Enhancing Imbalanced Sentiment Analysis: A GPT-3-Based Sentence-by-Sentence Generation Approach. Appl. Sci. 2024, 14, 622. [Google Scholar] [CrossRef]
Sub Data Set | Argument (A) | Non-Argument (NA) | Total |
---|---|---|---|
Training data set | 1527 | 1604 | 3131 |
Testing data set | 333 | 351 | 684 |
Validation data set | 98 | 102 | 200 |
Texts | Training Set | Testing Set | Validation Set | |||
---|---|---|---|---|---|---|
A | NA | A | NA | A | NA | |
Original texts | 1527 | 1604 | 333 | 351 | 98 | 102 |
Generated texts | 1529 | 1604 | 333 | 351 | 98 | 102 |
Texts successfully generated | 1503 | 1537 | 313 | 335 | 91 | 98 |
Texts | Training Set | Testing Set | Validation Set | |||
---|---|---|---|---|---|---|
A | NA | A | NA | A | NA | |
Original texts | 1527 | 1604 | 333 | 351 | 98 | 102 |
Total texts correctly generated | 1530 | 1562 | 327 | 351 | 95 | 104 |
Total | 3057 | 3166 | 660 | 702 | 193 | 206 |
Data Sets | Training Set | Testing Set | Validation Set | |||
---|---|---|---|---|---|---|
A | NA | A | NA | A | NA | |
Records with retrieved features—Automatically | 1500 | 1537 | 313 | 335 | 91 | 98 |
Records with retrieved features—Manually | 27 | 25 | 14 | 16 | 4 | 6 |
Total records with features recovered | 1527 | 1562 | 327 | 351 | 95 | 104 |
Classifier | F1: NA | F1: A | Test Accuracy | |
---|---|---|---|---|
F1: 0 | F1: 1 | Decimal | Percentage | |
LSTM | 0.84 | 0.83 | 0.84 | 83.63% |
CuDNNLSTM | 0.76 | 0.84 | 0.781 | 84.36% |
Bidirectional CuDNNLSTM | 0.85 | 0.85 | 0.85 | 84.65% |
CNN | 0.86 | 0.86 | 0.86 | 85.96% |
CNN-LSTM | 0.86 | 0.86 | 0.86 | 85.82% |
CNN-LSTM-MLP | 0.85 | 0.86 | 0.86 | 85.53% |
Id_Topic | Topic Name | R-Squared | Prob (F-Statistic) | Level of Significance |
---|---|---|---|---|
0 | elections vote Castillo Keiko second round | 0.002 | 0.00334 | Significant |
1 | constituent assembly elections president congress new constitution | 0.083 | Very significant | |
2 | elections electoral vote onpe security local | 0.046 | Very significant | |
3 | elections presidential party candidate congress peru | 0.014 | Very significant | |
4 | elections electoral jne new constitution constituent assembly | 0.003 | 0.00118 | Significant |
Constant | F-Statistic | R-Square (p-Value) | Prob (F-Statistic) | Durbin-Watson |
---|---|---|---|---|
−7.5862 | 28.02 | 0.007 | 1.792 |
Id_Topic | F-Statistic | R-Square | Prob (F-Statistic) (p-Value) | Durbin–Watson |
---|---|---|---|---|
Topic 0 | 8.624 | 0.002 | 1.441 | |
Topic 1 | 363.9 | 0.083 | 1.783 | |
Topic 2 | 193.6 | 0.046 | 1.684 | |
Topic 3 | 56.48 | 0.014 | 1.673 | |
Topic 4 | 10.54 | 0.003 | 1.873 |
Predicted_Topic | Coef | std-Error | t | P > |t| | 0.025 | 0.975 |
---|---|---|---|---|---|---|
Intercept | 0.4227 | 0.007 | 57.868 | 0.000 | 0.408 | 0.437 |
Predicted_Topic_0 | 0.1490 | 0.010 | 14.647 | 0.000 | 0.129 | 0.169 |
Predicted_Topic_1 | 0.3362 | 0.049 | 6.881 | 0.000 | 0.240 | 0.432 |
Predicted_Topic_2 | −0.1478 | 0.042 | −3.544 | 0.000 | −0.229 | −0.066 |
Predicted_Topic_3 | −0.0358 | 0.024 | −1.516 | 0.130 | −0.082 | 0.010 |
Predicted_Topic_4 | 0.1210 | 0.010 | 12.509 | 0.000 | 0.102 | 0.140 |
Date_Numeric_log | F-Statistic | R-Square (p-Value) | Prob (F-Statistic) | DW |
---|---|---|---|---|
8.0995 | 27.65 | 0.007 | 0.019 |
Coef | std-Error | z | P >|z|* | 0.025 | 0.975 | |
---|---|---|---|---|---|---|
const | −692.1805 | 133.629 | −5.180 | 0.000 | 0.408 | −430.272 |
Fecha_Numerico_log | 326.413 | 6.302 | 5.180 | 0.000 | 20.290 | 44.993 |
Id_Topic | F-Statistic | R-Square | Prob (F-Statistic) * | Durbin–Watson |
---|---|---|---|---|
Topic 0 | 473.7 | 0.107 | 1.992 | |
Topic 1 | 331.8 | 0.077 | 1.904 | |
Topic 2 | 43.45 | 0.011 | 1.976 | |
Topic 3 | 0.7598 | 0 | 0.383 | 1.951 |
Topic 4 | 66.99 | 0.017 | 2.036 |
Predicted_Topic | Coef | std-Error | z | P > |z|* | 0.025 | 0.975 |
---|---|---|---|---|---|---|
const | 0.4149 | 0.007 | 59.735 | 0.000 | 0.401 | 0.428 |
Predicted_Topic_0 | 0.3043 | 0.037 | 8.223 | 0.000 | 0.232 | 0.377 |
Predicted_Topic_1 | −0.0683 | 0.034 | −2.026 | 0.043 | −0.134 | −0.002 |
Predicted_Topic_2 | 0.0096 | 0.009 | 1.045 | 0.296 | −0.008 | 0.028 |
Predicted_Topic_3 | 0.0800 | 0.002 | 48.112 | 0.000 | 0.077 | 0.083 |
Predicted_Topic_4 | 0.0892 | 0.009 | 9.561 | 0.000 | 0.071 | 0.107 |
Predicted_Topic | Coef | std-Error | z | P > |z|* | 0.025 | 0.975 |
---|---|---|---|---|---|---|
const | 0.4155 | 0.019 | 21.502 | 0.000 | 0.378 | 0.453 |
Predicted_Topic_0_Proportion | 0.0829 | 0.008 | 9.867 | 0.000 | 0.066 | 0.099 |
Predicted_Topic_1_Proportion | 0.3311 | 0.093 | 3.570 | 0.000 | 0.149 | 0.513 |
Predicted_Topic_2_Proportion | −0.1328 | 0.071 | −1.869 | 0.062 | −0.272 | 0.006 |
Predicted_Topic_3_Proportion | −0.0889 | 0.062 | −1.434 | 0.152 | −0.210 | 0.033 |
Predicted_Topic_4_Proportion | 0.3251 | 0.082 | 3.961 | 0.000 | 0.164 | 0.486 |
Predicted_Topic | Coef | std-Error | z | P > |z|* | 0.025 | 0.975 |
---|---|---|---|---|---|---|
const | 0.3940 | 0.019 | 20.382 | 0.000 | 0.356 | 0.432 |
Predicted_Topic_0_Proportion | 0.0786 | 0.008 | 9.363 | 0.000 | 0.062 | 0.095 |
Predicted_Topic_1_Proportion | 0.3147 | 0.093 | 3.393 | 0.001 | 0.133 | 0.496 |
Predicted_Topic_2_Proportion | −0.1265 | 0.071 | -1.780 | 0.075 | −0.266 | 0.013 |
Predicted_Topic_3_Proportion | −0.0847 | 0.062 | −1.367 | 0.172 | −0.206 | 0.037 |
Predicted_Topic_4_Proportion | 0.3089 | 0.082 | 3.763 | 0.000 | 0.148 | 0.470 |
Predicted_Topic | coef | std-Error | z | P > |z|* | 0.025 | 0.975 |
---|---|---|---|---|---|---|
const | 0.4153 | 0.019 | 21.505 | 0.000 | 0.377 | 0.453 |
Predicted_Topic_0_Proportion | 0.0843 | 0.008 | 10.046 | 0.000 | 0.068 | 0.101 |
Predicted_Topic_1_Proportion | 0.3457 | 0.093 | 3.731 | 0.000 | 0.164 | 0.527 |
Predicted_Topic_2_Proportion | −0.1437 | 0.071 | −2.024 | 0.043 | −0.283 | −0.005 |
Predicted_Topic_3_Proportion | −0.0982 | 0.062 | −1.587 | 0.113 | −0.220 | 0.023 |
Predicted_Topic_4_Proportion | 0.3382 | 0.082 | 4.123 | 0.000 | 0.177 | 0.499 |
Predicted_Topic | Coefficients | Prob (F-Statistic) * | Significance Level |
---|---|---|---|
Predicted_Topic_0 | 0.3854 | Very significant | |
Predicted_Topic_1 | −0.4071 | Very significant | |
Predicted_Topic_3 | −9.4341 | Very significant | |
Predicted_Topic_4 | 1.5446 | Very significant |
Model | Selected Epoch | Val Accuracy (%) | Val Loss |
---|---|---|---|
CNN | 9 | 86 | 0.3047 |
CNN-LSTM | 10 | 86.50 | 0.4341 |
CNN-LSTM-MLP | 10 | 88.50 | 0.3534 |
Classifier | F1: NA | F1: A | Test Accuracy | |
---|---|---|---|---|
F1: 0 | F1: 1 | Decimal | Percentage | |
CNN | 0.86 | 0.85 | 0.855 | 85.53% |
CNN-LSTM-MLP | 0.85 | 0.84 | 0.845 | 84.50% |
CNN-LSTM-MLP | 0.86 | 0.86 | 0.858 | 85.82% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guzman-Monteza, Y.; Fernandez-Luna, J.M.; Ribadas-Pena, F.J. IV-Nlp: A Methodology to Understand the Behavior of DL Models and Its Application from a Causal Approach. Electronics 2025, 14, 1676. https://doi.org/10.3390/electronics14081676
Guzman-Monteza Y, Fernandez-Luna JM, Ribadas-Pena FJ. IV-Nlp: A Methodology to Understand the Behavior of DL Models and Its Application from a Causal Approach. Electronics. 2025; 14(8):1676. https://doi.org/10.3390/electronics14081676
Chicago/Turabian StyleGuzman-Monteza, Yudi, Juan M. Fernandez-Luna, and Francisco J. Ribadas-Pena. 2025. "IV-Nlp: A Methodology to Understand the Behavior of DL Models and Its Application from a Causal Approach" Electronics 14, no. 8: 1676. https://doi.org/10.3390/electronics14081676
APA StyleGuzman-Monteza, Y., Fernandez-Luna, J. M., & Ribadas-Pena, F. J. (2025). IV-Nlp: A Methodology to Understand the Behavior of DL Models and Its Application from a Causal Approach. Electronics, 14(8), 1676. https://doi.org/10.3390/electronics14081676