Jianing Sun1, Zhichao Zhang2, Xiaopu Wang1, Xinyuan Ji1, Yizhi Zhang1,*
Yizhi Zhang
1School of Computer Science, Shaanxi Normal University, Xi’an, Shaanxi, 710062, China
2School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, 571158, China
*Corresponding author
Since the introduction of Chain-of-Thought (CoT), leveraging Large Language Models (LLMs) to solve complex reasoning problems has become possible. While an increasing number of studies focus on improving the accuracy of answers, there still lack of efficient mechanism for errors evaluation and rectification during the reasoning process. To tackle this challenge, we propose a new strategy, fallback prompting, to enable self-refinement of LLMs based on a feedback-driven method. Our main goal is to precisely locate and revise errors through a backward evaluation process. We conducted experiments on seven datasets across three reasoning tasks: arithmetic reasoning, symbolic reasoning, and knowledgeable reasoning. The results demonstrate that fallback prompting achieves state-of-the-art performance across all datasets and models. Notably, it achieves near-perfect accuracy of 99.3% on Chinese-school-Math with Qwen2.5 and delivers outstanding results on symbolic and knowledgeable reasoning tasks, including 91.7% accuracy on HIST and 97.3% on CSQA with GLM4. These findings highlight the effectiveness and robustness of fallback prompting in enhancing LLMs’ reasoning capabilities, offering a promising direction for improving reasoning accuracy through self-refinement.
Chain-of-Thought, Large Language Models, complex reasoning, prompt tuning, error propagation
Jianing Sun, Zhichao Zhang, Xiaopu Wang, Xinyuan Ji, Yizhi Zhang (2024). Fallback Prompting Guides Large Language Models for Accurate Responses in Complex Reasoning. Journal of Networking and Network Applications, Volume 4, Issue 3, pp. 109–117. https://doi.org/10.33969/J-NaNA.2024.040302.
[1] T. Mikolov, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, vol. 3781, 2013.
[2] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
[3] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein,
C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
[5] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in neural information processing systems, vol. 32, 2019.
[6] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023.
[7] J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in ACL 2018-56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), vol. 1. Association for Computational Linguistics, 2018, pp. 328–339.
[8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
[9] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,” ACM Transactions on Software Engineering and Methodology, 2023.
[10] K. Stechly, K. Valmeekam, and S. Kambhampati, “Chain of thoughtless-ness: An analysis of cot in planning,” arXiv preprint arXiv:2405.04776, 2024.
[11] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” arXiv preprint arXiv:2305.04091, 2023.
[12] S. Hao, Y. Gu, H. Ma, J. Hong, Z. Wang, D. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 8154–8173.
[13] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleav-ing retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 10 014–10 037.
[14] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, and J. Berant, “An-swering questions by meta-reasoning over multiple chains of thought,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 5942–5966.
[15] T. Schick, J. Dwivedi-Yu, R. Dess`ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Infor-mation Processing Systems, vol. 36, 2024.
[16] A. Song, J. Fu, X. Mu, X. Zhu, and K. Cheng, “L-secnet: Towards secure and lightweight deep neural network inference,” Journal of Networking and Network Applications, vol. 3, no. 4, pp. 171–181, 2024.
[17] W. He, P.-H. Ho, D. Wang, and L. Xiao, “Efficient beacon deployment for large-scale positioning,” Journal of Networking and Network Appli-cations, vol. 1, no. 2, pp. 40–48, 2021.
[18] N. Ho, L. Schmid, and S. Yun, “Large language models are reasoning teachers,” in 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023. Association for Computational Linguistics (ACL), 2023, pp. 14 852–14 882.
[19] D. Zhou, N. Sch¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu-urmans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting enables complex reasoning in large language models,” arXiv preprint arXiv:2205.10625, 2022.
[20] H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V. Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” arXiv preprint arXiv:2310.06117, 2023.
[21] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[22] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690.
[23] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang, W. Peng, M. Liu, B. Qin, and T. Liu, “Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future,” in Proceed-ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1173–1203.
[24] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
[25] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa: A question answering challenge targeting commonsense knowledge,” in Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019, p. 4149.
[26] Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, L. Zhang, B. Ma, and X. Li, “Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases,” arXiv preprint arXiv:2303.14742, 2023.
[27] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
[28] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” IEEE Transactions on Knowledge and Data Engineering, 2024.
[29] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022.
[32] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597.
[33] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large lan-guage models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
[34] T. Wu, W. Yuan, O. Golovneva, J. Xu, Y. Tian, J. Jiao, J. Weston, and S. Sukhbaatar, “Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge,” arXiv preprint arXiv:2407.19594, 2024.
[35] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Con-ference on Empirical Methods in Natural Language Processing, 2021, pp. 3045–3059.
[36] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” arXiv preprint arXiv:2110.07602, 2021.
[37] R. Hendel, M. Geva, and A. Globerson, “In-context learning creates task vectors,” arXiv preprint arXiv:2310.15916, 2023.
[38] S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff et al., “The prompt report: A systematic survey of prompting techniques,” arXiv preprint arXiv:2406.06608, 2024.
[39] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in The Eleventh International Conference on Learning Representations.
[40] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot, “Complexity-based prompting for multi-step reasoning,” in The Eleventh International Conference on Learning Representations, 2022.
[41] C. Zhou, W. You, J. Li, J. Ye, K. Chen, and M. Zhang, “Inform: Information entropy based multi-step reasoning for large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 3565–3576.
[42] J. Sun, Y. Luo, Y. Gong, C. Lin, Y. Shen, J. Guo, and N. Duan, “Enhanc-ing chain-of-thoughts prompting with iterative bootstrapping in large language models,” in Findings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 4074–4101.
[43] S. Jiang, Z. Shakeri, A. Chan, M. Sanjabi, H. Firooz, Y. Xia, B. Aky-ildiz, Y. Sun, J. Li, Q. Wang et al., “Resprompt: Residual connection prompting advances multi-step reasoning in large language models,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 5784–5809.
[44] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
[45] BELLEGroup, “Belle: Be everyone’s large language model engine,” https://github.com/LianjiaTech/BELLE, 2023.
[46] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa: A question answering challenge targeting commonsense knowledge,” arXiv preprint arXiv:1811.00937, 2018.
[47] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
[48] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
[49] Q. Team, “Qwen2. 5: A party of foundation models,” Qwen (Sept. 2024). url: https://qwenlm. github. io/blog/qwen2, vol. 5, 2024.
[50] B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang et al., “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024.
[51] A. Yang, B. Zhang, B. Hui, B. Gao, B. Yu, C. Li, D. Liu, J. Tu, J. Zhou, J. Lin et al., “Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement,” arXiv preprint arXiv:2409.12122, 2024.
[52] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” arXiv preprint arXiv:2211.12588, 2022.
[53] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, M. Huang, N. Duan, and W. Chen, “Tora: A tool-integrated reasoning agent for mathematical problem solving,” arXiv preprint arXiv:2309.17452, 2023.
[54] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai et al., “Chatglm: A family of large language models from glm-130b to glm-4 all tools,” arXiv preprint arXiv:2406.12793, 2024.
[55] A. Cattan, A. Jacovi, A. Fabrikant, J. Herzig, R. Aharoni, H. Rashkin, D. Marcus, A. Hassidim, Y. Matias, I. Szpektor et al., “Can few-shot work in long-context? recycling the context to generate demonstrations,” arXiv preprint arXiv:2406.13632, 2024.
[56] D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei, “Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers,” in ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
[57] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, D. Zhou, and X. Chen, “Large language models as optimizers,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=Bb4VGOWELI