In this paper, we introduce code prompting, a chain of prompts that transforms a natural language problem into code and directly prompts the LLM using the generated code without resorting to external code execution. We hypothesize that code prompts can elicit certain reasoning capabilities of LLMs trained on text and code and utilize the proposed method to improve conditional reasoning, the ability to infer different conclusions depending on the fulfillment of certain conditions.
We find that code prompting exhibits a high-performance boost for multiple LLMs (up to 22.52 percentage points on GPT 3.5, 7.75 on Mixtral, and 16.78 on Mistral) across multiple conditional reasoning datasets. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement. Furthermore, code prompts improve sample efficiency of in-context learning and facilitate state tracking of variables or entities.
We define code prompts as prompts that model a natural language (NL) problem with code. The code contains the logical structure needed to solve the problem, along with the original natural language text as code comments. To solve an NL task with code prompts, we define a chain of prompts that i) transform the NL text into code, and ii) use this code to generate the answer in natural language.
The generated code is composed of code that closely follows the original NL text. In particular, it creates variables for key entities in the question and documents and if blocks for each conditional statement in the documents. The figure below exemplifies this transformation.
We experiment with text+code LLMs, i.e., LLMs that are trained to solve natural language and coding tasks. Specifically, we use GPT 3.5 turbo, Mixtral 7x8B, and Mistral 7B
We evaluate both prompting formats across multiple conditional reasoning datasets: ConditionalQA, BoardgameQA, and ShARC.
Model | Prompt | CondQA | ShARC | BGQA-1 | BGQA-2 | BGQA-3 | Delta |
---|---|---|---|---|---|---|---|
GPT 3.5 | Text | 58.7 | 62.95 | 51.15 | 37.42 | 27.77 | 8.42 |
Code | 60.6 | 54.98 | 58.67 | 55.56 | 50.29 | ||
Mixtral | Text | 48.17 | 53.77 | 56.38 | 39.64 | 30.15 | 4.22 |
Code | 44.73 | 59.06 | 53.33 | 47.39 | 44.72 | ||
Mistral | Text | 35.74 | 43.6 | 47.4 | 48.78 | 47.86 | 2.74 |
Code | 33.28 | 49.92 | 53.8 | 51.27 | 48.79 |
Code prompts outperform text prompts in the majority of cases on the test set (11 out of 15). This trend holds true across models, with each achieving peak performance through code prompts for most datasets (i.e., GPT-3.5 in 4/5, Mixtral in 3/5, Mistral in 4/5). Notably, code prompts consistently surpass text prompts on BGQA-2 and BGQA-3, the most reasoning-intensive datasets for all models. This is particularly evident for GPT-3.5, where gains exceed 18 points.
We hypothesize that one of the reasons for the superior performance of code prompting is an improved ability to identify and track the states of key variables or concepts. This hypothesis is based on the intuition that, for natural language in general, local context is the most important part to generate the next token. However, generating code is often more challenging because code frequently refers to previously defined functions and variables, which can be dozens or even hundreds of lines apart.
To test our hypothesis, we devise the following experiment. After generating each reasoning step in the answer response, we stop the GPT 3.5 Turbo generation and query about all key entities defined in the input prompt. In the case of text prompts, we query the model whether the given facts are true or not, and for code prompts, we query for the value of the (boolean) variables. In all cases, the model only has to generate True False, string, or unknown. Then, we compare the percentage of errors in text and code prompts. This number represents the memory errors committed by the model. The more memory errors there are, the more difficult it is for the model to track and remember entities/variables.
Dataset | Correct Ans. | Incorrect Ans. | ||
---|---|---|---|---|
Text | Code | Text | Code | |
CondQA | 71.08 | 4.39 | 60.79 | 11.39 |
BGQA-1 | 39.33 | 8.84 | 51.65 | 22.12 |
BGQA-2 | 44.79 | 15.04 | 52.54 | 24.75 |
BGQA-3 | 54.01 | 14.21 | 52.13 | 16.98 |
We observe that Text Prompts make significantly more memory errors than code prompts on all datasets. Specifically, the gap is consistently more than 30% with peaks on CondQA (66.69%) and BGQA-3 (39.8\%). Therefore, this experiment empirically confirms our hypothesis that code prompts improve state tracking of the key entities and variables when compared to text prompts.
Given our observations that code prompts trigger conditional reasoning abilities better than text prompts, we wonder whther code prompts are also more sample-efficient than text prompts? To answer this, we evaluate how the overall performance of GPT 3.5 changes with respect to the number of demonstrations for the two prompting methods
The figure shows that when we only provide one demonstration per class (i.e., answer type in our datasets), the performance gap is the largest across all datasets. As expected, this gap decreases when we provide more demonstrations. Moreover, we also observe that code prompts with only one demonstration per class even outperform text prompts with three demonstrations per class, which further shows the sample efficiency of code prompts. These results indicate that code prompts trigger conditional reasoning more efficiently than text prompts on GPT 3.5, and this is one of the reasons for its superior performance.
@article{puerto2024code,
title={Code Prompting Elicits Conditional Reasoning Abilities in Text+ Code LLMs},
author={Puerto, Haritz and Tutek, Martin and Aditya, Somak and Zhu, Xiaodan and Gurevych, Iryna},
journal={arXiv preprint arXiv:2401.10065},
year={2024}
}