Divergent Chain of Thought (DCoT)

Abstract

We present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT).

We generate a DCoT dataset where a question is answered by a series of alternative (and correct) chains of thought. Importantly, all these CoTs are part of the same label, thus, forcing the LLM to learn how to generate multiple CoTs in a single inference step.

We find that instruction tuning on DCoT datasets boosts the performance of LLMs of all sizes (from 1.3B to 70B). These performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models.

Diverse Chain of Thought (DCoT)

We instruction-tune LLMs to generate a sequence of divergent CoTs before selecting the final answer in a single inference step at inference time. To this end, we devise a DCoT instruction template, where we introduce a set of commands (in brackets) to request the number of CoTs to generate:

Prompt:[Question] Question [Options] Options [Number of answers] k
Response:[Answer 1] CoT_1 [Answer 2] ... [Answer k] CoT_k [Final answer] answer

Chain of Thought (CoT) Baseline

Similarly, to establish a comparable baseline, we instruction-tune the same LLMs using the more traditional CoT format. To ensure a fair comparison, we use the same reasoning chains for training. As shown in the figure above, each data point is composed of a question and a CoT, and a question may appear in more than one data point but with a different CoT. In this way, the model leverages CoT diversity at training time but, unlike in DCoT, it does not do so at inference time.

CoT Data Generation

We use GPT 3.5 turbo in the zero-shot setting with multiple triggers, such as Let's think step by step to generate CoTs. For each question, we select four random CoT triggers. We limit the number of CoTs to four to ensure that the targets fit the context window of the LLMs.

Results

Method	Phi 1.5 (1.3B)	Phi 2 (2.7B)	LLaMA 7B	LLaMA 13B	LLaMA 70B
DCoT	49.39	62.60	60.80	66.18	68.63
CoT	47.20	60.85	58.97	64.39	66.96

The table shows the average results of DCoT and CoT across 8 QA reasoning tasks. We observe that DCoT achieves consistent and significant performance gains compared to CoT across all LLM families and sizes.

The datasets are:

Dataset	Reasoning Type
ARC	High-School Science
BGQA	Logic
CoinFlip	State-tracking
ConditionalQA	Conditional
GSM8K	Math
HotpotQA	Explicit multli-hop
LCC	Symbolic
Quartz	Qualitative relationships
StrategyQA	Implicit multi-hop

Self-Correction (DCoT@2 - DCoT@1)

The table below shows the performance gain by genrating 2 CoTs (i.e, DCoT@2 - DCoT@1). We observe performance gains by simply generating a second CoT in over 62% of cases (i.e., 25 out of 40 LLM x Dataset) and larger than 0.5 for more than half of the datasets for Phi 1.5, Phi2, LLaMA2 13B, and 70B.

LLM	ARC	BGQA	CQA	GSM8K	HQA	LLC	Quartz	StrQA
Phi 1.5	1.26	2.1	0.1	3	0.83	-14	3.38	1.11
Phi 2	-3.56	-2.38	0.95	0.8	1.06	14	1.55	-0.85
LLama2 7B	1.28	-0.99	-0.56	4	-0.01	6	-1.04	0.25
LLama2 13B	4.15	0.91	-1.02	3	2.02	12	0.77	-2.03
LLama2 70B	3.24	1.38	3.68	10	0	4	-1	-4.07

These results indicate that DCoT tuning enables models to self-correct.
It is important to note that our training data includes only reasoning chains that lead to the correct answer, never incorrect ones. This suggests that the ability to self-correct can be enabled in LLMs without explicitly training for it.
We argue that this self-correct ability stem from the model's attempt to generate subsequent correct CoTs. In other words, the model may generate a first wrong CoT without knowing it, but it generates a second CoT that is correct and, therefore, as a side-effect, corrects the first one.

BibTeX

@misc{puerto2024dcot, title={Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models}, author={Haritz Puerto and Tilek Chubakov and Xiaodan Zhu and Harish Tayyar Madabushi and Iryna Gurevych}, year={2024}, eprint={2407.03181}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.03181}, }

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models