Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales
Abstract
1. Introduction
- We frame R2D-C as a method for improving prediction–rationale alignment in clinical NLP, rather than primarily for maximizing predictive accuracy, and show that this can be achieved without sacrificing predictive performance.
- We augment R2D’s stage 2 with dynamic confidence-adaptive scheduled sampling, which decides when explanations should be conditioned on predicted versus gold labels, thereby reducing exposure bias while preventing conditioning on low-confidence predictions.
- We introduce a cycle-consistency loss that explicitly encourages rationales to be sufficient for reconstructing the labels they condition on.
- We integrate these into a unified framework and evaluate on triage, biomedical QA, diagnosis prediction, and multi-step medical reasoning tasks, demonstrating large gains in prediction–rationale alignment and LLM-judged rationale quality without sacrificing accuracy.

2. Related Work
2.1. Explainable NLP and Faithful Self-Rationalization
2.2. Multi-Task Learning and Knowledge Distillation
2.3. Rationale-Driven Clinical NLP
2.4. Exposure Bias and Scheduled Sampling
3. Materials and Methods
3.1. Methodology
3.1.1. Stage 1: Rationale Foundation Training
3.1.2. Stage 2: Joint Optimization of Prediction, Explanation, and Cycle Consistency
- Confidence-adaptive scheduled sampling that selectively conditions explanations on predicted labels based on per-example confidence.
- A differentiable cycle-consistency objective that enforces rationales to be sufficient for reconstructing the conditioning label.
- Prediction Task
- Explanation with Confidence-Adaptive Scheduled Sampling
- Cycle Consistency: Rationale-to-Label Mapping
- Combined Objective and Training Schedule
| Algorithm 1: Algorithm for one stage 2 training step in R2D-C |
|
- Increase the prediction weight from 0 to 0.7 linearly. This is to prioritize explanation loss over prediction loss , to provide a smoother transition from single-task to multi-task optimization [8]. After warm-up, remains constant at . The choice of was made following [8] to balance both tasks while slightly emphasizing prediction since stage 1 already established rationale generation capabilities.
- Disable the cycle loss (). This allows the model to stabilize on prediction and explanation tasks before enforcing the stricter rationale-to-label consistency constraint.
3.1.3. Inference
3.2. Tasks and Datasets
3.2.1. Clinical Triage Dataset
3.2.2. PubMedQA
3.2.3. DDXPlus
3.2.4. MedReason
3.3. Implementation Details
3.4. Baselines
3.5. Evaluation
- Consistency (P1 = P2): This metric is motivated by [34] and measures the agreement between the label predicted from the original input (P1) and the label predicted from the generated rationale alone (P2). High consistency indicates that the rationale preserves the model’s decision logic, serving as a proxy for prediction-rationale alignment.
- Sufficiency (P2 = GT): Sufficiency evaluates whether the generated rationale contains sufficient information to recover the gold label (GT) without access to the original input. Given only , the model is asked to predict the label P2. This metric is motivated by the ERASER benchmark protocol for evaluating rationalized predictions [35]. This measures label recoverability from the generated rationale but does not by itself establish that the rationale reflects the model’s full internal causal reasoning.
- Gold Sufficiency (GR → GT): This metric measures whether providing the gold rationale (GR) to the model causes the model to output the gold label (GT). If GR → GT is low, the model struggles to recover the label even from a perfect rationale, indicating a model-side limitation rather than a failure of the generated rationale.
- LLM-as-a-Judge Correctness: Recent work has demonstrated strong alignment between LLM and human evaluation [36,37]. With this motivation, we employed Qwen-3-8B and Qwen-3-32B [24] models for expert-style assessment on a random sample of 2000 examples from the Clinical Triage test set, without stratification by class, prediction correctness, or model confidence. Because this subset was not stratified, the resulting judge scores should be interpreted as approximate subset-level evidence rather than a controlled robustness analysis across label, error, or confidence strata. We used a 5-point scale for clinical alignment between the rationales and predictions. The motivation is “If the predicted disposition is Go to L&D now, does the generated rationale justify the decision, rather than supporting a different disposition (e.g., homecare?)” To ensure consistent evaluation, we generated standardized disposition definitions using prompt-based refinement with a Qwen-3-32B model.
4. Results and Discussion
4.1. Predictive Performance and Rationale Alignment
4.2. LLM-as-a-Judge Evaluation
5. Ablation Studies
5.1. Ablation of Core Components
5.2. Ablation on Soft Rationale Construction
5.3. Sensitivity to the Cycle-Consistency Weight
6. Limitations
7. Ethics Statement
8. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Prompts Used
| You are building a clinical definition for the disposition: [DISPOSITION_NAME]. Current working definition: “““[CURRENT_DEFINITION]””” You are given new examples of triage notes and rationales for this disposition. Use them to refine, expand, or correct the working definition. Keep the definition concise but clinically accurate. Triage Notes and Rationales: [EXAMPLES] Update the definition: - Incorporate any new key symptoms, criteria, or thresholds. - Remove incorrect parts. - Keep it as clear and specific as possible. Output only the revised definition text, nothing else. |
| Issue: [ISSUE_ASSESSMENT] Dispositions: [CLASSES_TEXT] Classify the healthcare issue into one of the dispositions above. Return your answer in the following **strict** format: Class: [chosen digit] Do not ask for more information, and do not provide any general statements. Only respond with the digit. |
| You are an expert clinical trainer for telephone triage nursing. [DEFINITIONS] Task: Given the following rationale and disposition, score the alignment on a scale of 1 to 5, where: 5—Excellent Alignment 4—Good Alignment 3—Moderate Alignment 2—Poor Alignment 1—Very Poor Alignment Rationale: [RATIONALE] Disposition: [DISPOSITION] Output exactly one number (1, 2, 3, 4, or 5) with no other text. |
Appendix B. Clinical Disposition Definitions
References
- Gurrapu, S.; Kulkarni, A.; Huang, L.; Lourentzou, I.; Batarseh, F.A. Rationalization for explainable NLP: A survey. Front. Artif. Intell. 2023, 6, 1225093. [Google Scholar] [CrossRef]
- Cross, J.L.; Choma, M.A.; Onofrey, J.A. Bias in medical AI: Implications for clinical decision-making. PLoS Digit. Health 2024, 3, e0000651. [Google Scholar] [CrossRef]
- Lyu, Q.; Apidianaki, M.; Callison-Burch, C. Towards Faithful Model Explanation in NLP: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
- Yang, J.; Glockner, M.; Rocha, A.; Gurevych, I. Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks. arXiv 2025. [Google Scholar] [CrossRef]
- Bhan, M.; Vittaut, J.N.; Chesneau, N.; Chandar, S.; Lesot, M.J. NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment. arXiv 2026. [Google Scholar] [CrossRef]
- Madsen, A.; Chandar, S.; Reddy, S. Are self-explanations from Large Language Models faithful? In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 295–337. [Google Scholar] [CrossRef]
- Hsieh, C.Y.; Li, C.L.; Yeh, C.K.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.Y.; Pfister, T. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. arXiv 2023. [Google Scholar] [CrossRef]
- Hasan, H.M.Q.; Bashier, H.K.; Dai, J.; Kim, M.Y.; Goebel, R. Reason2Decide: Rationale-Driven Multi-Task Learning. arXiv 2025. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
- Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS’15), Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 1171–1179. [Google Scholar]
- Schmidt, F. Generalization in Generation: A closer look at Exposure Bias. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, China, 4 November 2019; Birch, A., Finch, A., Hayashi, H., Konstas, I., Luong, T., Neubig, G., Oda, Y., Sudoh, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 157–167. [Google Scholar] [CrossRef]
- Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. arXiv 2017. [Google Scholar] [CrossRef]
- Kunz, J.; Kuhlmann, M. Properties and Challenges of LLM-Generated Explanations. In Proceedings of the Third Workshop on Bridging Human–Computer Interaction and Natural Language Processing, Mexico City, Mexico, 21 June 2024; Blodgett, S.L., Cercas Curry, A., Dev, S., Madaio, M., Nenkova, A., Yang, D., Xiao, Z., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 13–27. [Google Scholar] [CrossRef]
- Atakishiyev, S.; Babiker, H.K.B.; Dai, J.; Farruque, N.; Hayashi, T.; Hriti, N.S.; Rahman, M.A.; Smith, I.; Kim, M.Y.; Zaïane, O.R.; et al. Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations. arXiv 2025. [Google Scholar] [CrossRef]
- Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Ruder, S. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv 2017. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015. [Google Scholar] [CrossRef]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2023, 21, 5485–5551. [Google Scholar]
- Liu, Y.; Meng, F.; Chen, Y.; Xu, J.; Zhou, J. Confidence-Aware Scheduled Sampling for Neural Machine Translation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2327–2337. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Lewis, P.; Ott, M.; Du, J.; Stoyanov, V. Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Online, 19 November 2020; Rumshisky, A., Roberts, K., Bethard, S., Naumann, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 146–157. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef]
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
- Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2567–2577. [Google Scholar] [CrossRef]
- Tchango, A.F.; Goel, R.; Wen, Z.; Martel, J.; Ghosn, J. DDXPlus: A New Dataset For Automatic Medical Diagnosis. arXiv 2022. [Google Scholar] [CrossRef]
- Wu, J.; Deng, W.; Li, X.; Liu, S.; Mi, T.; Peng, Y.; Xu, Z.; Liu, Y.; Cho, H.; Choi, C.I.; et al. MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs. arXiv 2025. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2020. [Google Scholar] [CrossRef]
- Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.A.; Rouvier, M.; Dufour, R. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv 2024. [Google Scholar] [CrossRef]
- Ankit Pal, M.S. OpenBioLLMs: Advancing Open-Source Large Language Models for Healthcare and Life Sciences. 2024. Available online: https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B (accessed on 27 February 2026).
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2020. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; Isabelle, P., Charniak, E., Lin, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Dasgupta, S.; Frost, N.; Moshkovitz, M. Framework for Evaluating Faithfulness of Local Explanations. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 7–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; Proceedings of Machine Learning Research, PMLR: Cambridge, MA, USA, 2022; Volume 162, pp. 4794–4815. [Google Scholar]
- DeYoung, J.; Jain, S.; Rajani, N.F.; Lehman, E.; Xiong, C.; Socher, R.; Wallace, B.C. ERASER: A Benchmark to Evaluate Rationalized NLP Models. arXiv 2020. [Google Scholar] [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv 2023. [Google Scholar] [CrossRef]
- Niu, S.; Ma, J.; Lin, H.; Bai, L.; Wang, Z.; Xu, Y.; Song, Y.; Yang, X. Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 11011–11024. [Google Scholar] [CrossRef]
| Model | Per-GPU Batch | Grad Accum |
|---|---|---|
| T5-Small | 16 | 1 |
| T5-Base | 4 | 4 |
| Dataset | Model | Method | Macro F1 | Consistency | Sufficiency | Gold Sufficiency |
|---|---|---|---|---|---|---|
| Clinical Triage | T5-Small | R2D | 55.88 ± 0.01 [55.86, 55.90] | 39.64 ± 2.95 | 36.00 ± 2.00 | 41.47 ± 2.51 |
| R2D-C | 56.31 ± 0.30 [55.56, 57.06] | 81.30 ± 0.35 | 55.10 ± 0.59 | 77.87 ± 0.19 | ||
| DSS | 52.73 ± 0.99 [50.27, 55.19] | 42.36 ± 0.35 | 39.48 ± 0.26 | 47.02 ± 0.23 | ||
| T5-Base | R2D | 59.92 ± 0.42 [58.88, 60.96] | 45.28 ± 5.62 | 37.59 ± 5.08 | 46.61 ± 5.89 | |
| R2D-C | 60.25 ± 0.30 [59.50, 61.00] | 86.26 ± 0.45 | 59.38 ± 0.32 | 82.86 ± 0.16 | ||
| DSS | 53.26 ± 0.89 [51.05, 55.47] | 45.49 ± 3.10 | 40.38 ± 2.98 | 50.13 ± 4.58 | ||
| Qwen-3-8B | Zero-Shot | 23.88 ± 0.00 | – | – | – | |
| Qwen-3-32B | 33.08 ± 0.00 | – | – | – | ||
| BioMistral-7B | 6.45 ± 0.00 | – | – | – | ||
| OpenBioLLM-7B | 10.33 ± 0.00 | – | – | – | ||
| PubMedQA | T5-Small | R2D | 85.65 ± 0.21 [85.13, 86.17] | 68.86 ± 6.02 | 68.15 ± 4.53 | 68.55 ± 2.66 |
| R2D-C | 85.90 ± 0.12 [85.60, 86.20] | 90.52 ± 3.15 | 81.94 ± 1.70 | 78.55 ± 1.97 | ||
| DSS | 84.39 ± 0.48 [83.20, 85.58] | 81.45 ± 3.00 | 77.15 ± 1.88 | 80.61 ± 5.01 | ||
| T5-Base | R2D | 89.67 ± 0.31 [88.90, 90.44] | 64.78 ± 2.01 | 64.03 ± 1.28 | 67.14 ± 3.20 | |
| R2D-C | 89.80 ± 0.21 [89.28, 90.32] | 96.42 ± 0.27 | 87.71 ± 0.19 | 83.04 ± 3.48 | ||
| DSS | 89.30 ± 0.43 [88.23, 90.37] | 61.63 ± 0.42 | 62.39 ± 0.19 | 62.27 ± 2.24 | ||
| Qwen-3-8B | Zero-Shot | 84.24 ± 0.00 | – | – | – | |
| Qwen-3-32B | 90.67 ± 0.00 | – | – | – | ||
| BioMistral-7B | 29.44 ± 0.00 | – | – | – | ||
| OpenBioLLM-7B | 46.22 ± 0.00 | – | – | – | ||
| DDXPlus | T5-Small | R2D | 99.56 ± 0.05 [99.44, 99.68] | 4.14 ± 3.17 | 4.12 ± 3.17 | 4.56 ± 3.40 |
| R2D-C | 99.52 ± 0.00 [99.52, 99.52] | 92.95 ± 5.56 | 92.60 ± 5.55 | 91.96 ± 5.74 | ||
| DSS | 99.29 ± 0.14 [98.94, 99.64] | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.01 ± 0.01 | ||
| T5-Base | R2D | 99.53 ± 0.26 [98.88, 100.00] | 2.37 ± 1.64 | 2.37 ± 1.64 | 2.58 ± 1.73 | |
| R2D-C | 99.39 ± 0.23 [98.82, 99.96] | 99.53 ± 0.17 | 99.32 ± 0.16 | 99.17 ± 0.09 | ||
| DSS | 99.40 ± 0.31 [98.63, 100.00] | 5.04 ± 5.27 | 5.08 ± 5.31 | 5.13 ± 5.17 | ||
| Qwen-3-8B | Zero-Shot | 36.64 ± 0.00 | – | – | – | |
| Qwen-3-32B | 49.44 ± 0.00 | – | – | – | ||
| BioMistral-7B | 5.64 ± 0.00 | – | – | – | ||
| OpenBioLLM-7B | 24.34 ± 0.00 | – | – | – | ||
| MedReason | T5-Small | R2D | 16.93 ± 0.79 [14.97, 18.89] | 0.02 ± 0.04 | 0.02 ± 0.04 | 0.00 ± 0.00 |
| R2D-C | 16.96 ± 0.81 [14.95, 18.97] | 2.02 ± 0.34 | 0.97 ± 0.21 | 0.09 ± 0.03 | ||
| DSS | 15.92 ± 0.19 [15.45, 16.39] | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.00 ± 0.00 | ||
| T5-Base | R2D | 18.86 ± 0.16 [18.46, 19.26] | 0.05 ± 0.04 | 0.00 ± 0.00 | 0.00 ± 0.00 | |
| R2D-C | 18.88 ± 0.19 [18.41, 19.35] | 24.58 ± 13.19 | 8.58 ± 3.96 | 0.31 ± 0.10 | ||
| DSS | 15.15 ± 0.34 [14.31, 15.99] | 0.04 ± 0.08 | 0.07 ± 0.07 | 0.05 ± 0.04 | ||
| Qwen-3-8B | Zero-Shot | 46.87 ± 0.00 | – | – | – | |
| Qwen-3-32B | 77.29 ± 0.00 | – | – | – | ||
| BioMistral-7B | 37.59 ± 0.00 | – | – | – | ||
| OpenBioLLM-7B | 53.01 ± 0.00 | – | – | – |
| Model | Method | 8B Judge | 32B Judge |
|---|---|---|---|
| T5-Small | R2D | 4.74 ± 0.02 | 4.18 ± 0.01 |
| R2D-C | 4.86 ± 0.02 | 4.44 ± 0.01 | |
| DSS | 4.64 ± 0.03 | 3.99 ± 0.02 | |
| T5-Base | R2D | 4.79 ± 0.02 | 4.34 ± 0.01 |
| R2D-C | 4.87 ± 0.03 | 4.48 ± 0.01 | |
| DSS | 4.66 ± 0.02 | 4.07 ± 0.02 |
| Dataset | Model | Method | Macro F1 | Consistency | Sufficiency | Gold Sufficiency | 8B Judge | 32B Judge |
|---|---|---|---|---|---|---|---|---|
| Clinical Triage | T5-Small | w/o cycle | 56.41 | 41.71 | 36.41 | 41.56 | 4.90 | 4.44 |
| w/o CAS | 56.49 | 69.62 | 52.21 | 74.35 | 4.78 | 4.19 | ||
| R2D | 55.88 | 39.64 | 36.00 | 41.47 | 4.74 | 4.18 | ||
| R2D-C | 56.31 | 81.30 | 55.10 | 77.87 | 4.86 | 4.44 | ||
| T5-Base | w/o cycle | 59.69 | 51.42 | 41.17 | 50.46 | 4.88 | 4.47 | |
| w/o CAS | 59.75 | 79.01 | 57.44 | 81.53 | 4.80 | 4.35 | ||
| R2D | 59.92 | 45.28 | 37.59 | 46.61 | 4.79 | 4.34 | ||
| R2D-C | 60.25 | 86.26 | 59.38 | 82.86 | 4.87 | 4.48 | ||
| PubMedQA | T5-Small | w/o cycle | 85.64 | 71.62 | 69.85 | 69.86 | – | – |
| w/o CAS | 85.87 | 91.42 | 82.24 | 78.92 | – | – | ||
| R2D | 85.65 | 68.86 | 68.15 | 68.55 | – | – | ||
| R2D-C | 85.90 | 90.52 | 81.94 | 78.55 | – | – | ||
| T5-Base | w/o cycle | 89.91 | 66.07 | 64.53 | 67.60 | – | – | |
| w/o CAS | 89.64 | 96.98 | 87.89 | 84.43 | – | – | ||
| R2D | 89.67 | 64.78 | 64.03 | 67.14 | – | – | ||
| R2D-C | 89.80 | 96.42 | 87.71 | 83.04 | – | – | ||
| DDXPlus | T5-Small | w/o cycle | 99.53 | 6.70 | 6.67 | 6.98 | – | – |
| w/o CAS | 99.55 | 97.10 | 96.80 | 96.42 | – | – | ||
| R2D | 99.56 | 4.14 | 4.12 | 4.56 | – | – | ||
| R2D-C | 99.52 | 92.95 | 92.60 | 91.96 | – | – | ||
| T5-Base | w/o cycle | 99.56 | 4.09 | 4.09 | 4.11 | – | – | |
| w/o CAS | 99.66 | 99.64 | 99.41 | 99.26 | – | – | ||
| R2D | 99.53 | 2.37 | 2.37 | 2.58 | – | – | ||
| R2D-C | 99.39 | 99.53 | 99.32 | 99.17 | – | – |
| Model | Method | Macro F1 | Consistency | Sufficiency | Gold Sufficiency | 8B Judge | 32B Judge |
|---|---|---|---|---|---|---|---|
| T5-Small | regular_softmax | 56.45 | 80.67 | 54.99 | 76.95 | 4.84 | 4.44 |
| temperature_softmax | 56.34 | 80.83 | 54.61 | 77.27 | 4.84 | 4.44 | |
| R2D-C | 56.31 | 81.30 | 55.10 | 77.87 | 4.86 | 4.44 | |
| T5-Base | regular_softmax | 59.78 | 76.26 | 53.84 | 73.60 | 4.87 | 4.48 |
| temperature_softmax | 59.53 | 78.54 | 54.67 | 74.20 | 4.88 | 4.50 | |
| R2D-C | 60.25 | 86.26 | 59.38 | 82.86 | 4.87 | 4.48 |
| Dataset | Macro F1 | Consistency | Sufficiency | Gold Sufficiency | |
|---|---|---|---|---|---|
| Clinical Triage | 0.0 | 56.41 | 41.71 | 36.41 | 41.56 |
| 0.1 | 56.31 | 81.30 | 55.10 | 77.87 | |
| 0.2 | 55.78 | 82.04 | 55.40 | 78.89 | |
| 0.5 | 55.57 | 83.48 | 55.28 | 79.37 | |
| PubMedQA | 0.0 | 85.64 | 71.62 | 69.85 | 69.86 |
| 0.1 | 85.90 | 90.52 | 81.94 | 78.55 | |
| 0.2 | 85.79 | 91.63 | 82.00 | 78.53 | |
| 0.5 | 85.81 | 94.13 | 83.30 | 80.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hasan, H.M.Q.; Babiker, H.K.B.; Kim, M.-Y.; Goebel, R. Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales. Computers 2026, 15, 279. https://doi.org/10.3390/computers15050279
Hasan HMQ, Babiker HKB, Kim M-Y, Goebel R. Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales. Computers. 2026; 15(5):279. https://doi.org/10.3390/computers15050279
Chicago/Turabian StyleHasan, H M Quamran, Housam Khalifa Bashier Babiker, Mi-Young Kim, and Randy Goebel. 2026. "Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales" Computers 15, no. 5: 279. https://doi.org/10.3390/computers15050279
APA StyleHasan, H. M. Q., Babiker, H. K. B., Kim, M.-Y., & Goebel, R. (2026). Reason2Decide-C: Adaptive Cycle-Consistent Training for Clinical Rationales. Computers, 15(5), 279. https://doi.org/10.3390/computers15050279

