We present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT).
We generate a DCoT dataset where a question is answered by a series of alternative (and correct) chains of thought. Importantly, all these CoTs are part of the same label, thus, forcing the LLM to learn how to generate multiple CoTs in a single inference step.
We find that instruction tuning on DCoT datasets boosts the performance of LLMs of all sizes (from 1.3B to 70B). These performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models.
We instruction-tune LLMs to generate a sequence of divergent CoTs before selecting the final answer in a single inference step at inference time. To this end, we devise a DCoT instruction template, where we introduce a set of commands (in brackets) to request the number of CoTs to generate:
Prompt:[Question] Question [Options] Options [Number of answers] k Response:[Answer 1] CoT_1 [Answer 2] ... [Answer k] CoT_k [Final answer] answer
Similarly, to establish a comparable baseline, we instruction-tune the same LLMs using the more traditional CoT format. To ensure a fair comparison, we use the same reasoning chains for training. As shown in the figure above, each data point is composed of a question and a CoT, and a question may appear in more than one data point but with a different CoT. In this way, the model leverages CoT diversity at training time but, unlike in DCoT, it does not do so at inference time.
We use GPT 3.5 turbo in the zero-shot setting with multiple triggers, such as Let's think step by step to generate CoTs. For each question, we select four random CoT triggers. We limit the number of CoTs to four to ensure that the targets fit the context window of the LLMs.
Method | Phi 1.5 (1.3B) | Phi 2 (2.7B) | LLaMA 7B | LLaMA 13B | LLaMA 70B |
---|---|---|---|---|---|
DCoT | 49.39 | 62.60 | 60.80 | 66.18 | 68.63 |
CoT | 47.20 | 60.85 | 58.97 | 64.39 | 66.96 |
The table shows the average results of DCoT and CoT across 8 QA reasoning tasks. We observe that DCoT achieves consistent and significant performance gains compared to CoT across all LLM families and sizes.
The datasets are:
Dataset | Reasoning Type |
---|---|
ARC | High-School Science |
BGQA | Logic |
CoinFlip | State-tracking |
ConditionalQA | Conditional |
GSM8K | Math |
HotpotQA | Explicit multli-hop |
LCC | Symbolic |
Quartz | Qualitative relationships |
StrategyQA | Implicit multi-hop |
The table below shows the performance gain by genrating 2 CoTs (i.e, DCoT@2 - DCoT@1). We observe performance gains by simply generating a second CoT in over 62% of cases (i.e., 25 out of 40 LLM x Dataset) and larger than 0.5 for more than half of the datasets for Phi 1.5, Phi2, LLaMA2 13B, and 70B.
LLM | ARC | BGQA | CQA | GSM8K | HQA | LLC | Quartz | StrQA |
---|---|---|---|---|---|---|---|---|
Phi 1.5 | 1.26 | 2.1 | 0.1 | 3 | 0.83 | -14 | 3.38 | 1.11 |
Phi 2 | -3.56 | -2.38 | 0.95 | 0.8 | 1.06 | 14 | 1.55 | -0.85 |
LLama2 7B | 1.28 | -0.99 | -0.56 | 4 | -0.01 | 6 | -1.04 | 0.25 |
LLama2 13B | 4.15 | 0.91 | -1.02 | 3 | 2.02 | 12 | 0.77 | -2.03 |
LLama2 70B | 3.24 | 1.38 | 3.68 | 10 | 0 | 4 | -1 | -4.07 |
These results indicate that DCoT tuning enables models to self-correct.
It is important to note that our training data includes only reasoning chains that lead to the correct answer, never incorrect ones. This suggests that the ability to self-correct can be enabled in LLMs without explicitly training for it.
We argue that this self-correct ability stem from the model's attempt to generate subsequent correct CoTs. In other words, the model may generate a first wrong CoT without knowing it, but it generates a second CoT that is correct and, therefore, as a side-effect, corrects the first one.
@misc{puerto2024dcot,
title={Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models},
author={Haritz Puerto and Tilek Chubakov and Xiaodan Zhu and Harish Tayyar Madabushi and Iryna Gurevych},
year={2024},
eprint={2407.03181},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.03181},
}