Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

1UKP Lab, Technical University of Darmstadt, Hessian.AI 2Dept. of ECE & Ingenuity Labs Research Institute, Queen's University 3University of Bath
TLDR: Divergent Chain of Thought (DCoT) consists of requiring models to generate multiple CoTs before choosing an answer. Adding DCoT data to instruction tuning allows models to improve performance through self-correction.

Abstract

We present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT).

We generate a DCoT dataset where a question is answered by a series of alternative (and correct) chains of thought. Importantly, all these CoTs are part of the same label, thus, forcing the LLM to learn how to generate multiple CoTs in a single inference step.

We find that instruction tuning on DCoT datasets boosts the performance of LLMs of all sizes (from 1.3B to 70B). These performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models.

Intro

Methods

Method

Diverse Chain of Thought (DCoT)

We instruction-tune LLMs to generate a sequence of divergent CoTs before selecting the final answer in a single inference step at inference time. To this end, we devise a DCoT instruction template, where we introduce a set of commands (in brackets) to request the number of CoTs to generate:

Prompt:[Question] Question [Options] Options [Number of answers] k
Response:[Answer 1] CoT_1 [Answer 2] ... [Answer k] CoT_k [Final answer] answer

Chain of Thought (CoT) Baseline

Similarly, to establish a comparable baseline, we instruction-tune the same LLMs using the more traditional CoT format. To ensure a fair comparison, we use the same reasoning chains for training. As shown in the figure above, each data point is composed of a question and a CoT, and a question may appear in more than one data point but with a different CoT. In this way, the model leverages CoT diversity at training time but, unlike in DCoT, it does not do so at inference time.

CoT Data Generation

We use GPT 3.5 turbo in the zero-shot setting with multiple triggers, such as Let's think step by step to generate CoTs. For each question, we select four random CoT triggers. We limit the number of CoTs to four to ensure that the targets fit the context window of the LLMs.

Results

Method Phi 1.5 (1.3B) Phi 2 (2.7B) LLaMA 7B LLaMA 13B LLaMA 70B
DCoT 49.39 62.60 60.80 66.18 68.63
CoT 47.20 60.85 58.97 64.39 66.96

The table shows the average results of DCoT and CoT across 8 QA reasoning tasks. We observe that DCoT achieves consistent and significant performance gains compared to CoT across all LLM families and sizes.

The datasets are:

Dataset Reasoning Type
ARC High-School Science
BGQA Logic
CoinFlip State-tracking
ConditionalQA Conditional
GSM8K Math
HotpotQA Explicit multli-hop
LCC Symbolic
Quartz Qualitative relationships
StrategyQA Implicit multi-hop

Self-Correction (DCoT@2 - DCoT@1)

The table below shows the performance gain by genrating 2 CoTs (i.e, DCoT@2 - DCoT@1). We observe performance gains by simply generating a second CoT in over 62% of cases (i.e., 25 out of 40 LLM x Dataset) and larger than 0.5 for more than half of the datasets for Phi 1.5, Phi2, LLaMA2 13B, and 70B.

LLM ARC BGQA CQA GSM8K HQA LLC Quartz StrQA
Phi 1.5 1.26 2.1 0.1 3 0.83 -14 3.38 1.11
Phi 2 -3.56 -2.38 0.95 0.8 1.06 14 1.55 -0.85
LLama2 7B 1.28 -0.99 -0.56 4 -0.01 6 -1.04 0.25
LLama2 13B 4.15 0.91 -1.02 3 2.02 12 0.77 -2.03
LLama2 70B 3.24 1.38 3.68 10 0 4 -1 -4.07

These results indicate that DCoT tuning enables models to self-correct.
It is important to note that our training data includes only reasoning chains that lead to the correct answer, never incorrect ones. This suggests that the ability to self-correct can be enabled in LLMs without explicitly training for it.
We argue that this self-correct ability stem from the model's attempt to generate subsequent correct CoTs. In other words, the model may generate a first wrong CoT without knowing it, but it generates a second CoT that is correct and, therefore, as a side-effect, corrects the first one.

BibTeX

@misc{puerto2024dcot,
                        title={Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models}, 
                        author={Haritz Puerto and Tilek Chubakov and Xiaodan Zhu and Harish Tayyar Madabushi and Iryna Gurevych},
                        year={2024},
                        eprint={2407.03181},
                        archivePrefix={arXiv},
                        primaryClass={cs.CL},
                        url={https://arxiv.org/abs/2407.03181}, 
                      }