Introduction

Recently, slow-thinking large language models have shown significantly improvement on complex reasoning tasks, like mathematical reasoning. However, the model released by DeepSeek cannot manipulate external tools during reasoning process. To further explore the potential ability of tool manipulation of slow-thinking model, we release this report and present the whole details of our experiment on enhancing the reasoning ability of slow-thinking model via tool manipulation. As a preliminary exploration, we focus on mathematical reasoning tasks and propose STILL-3-Tool-32B, a model can leverage python code to help its reasoning process. During evaluation, our model achieve 81.7% accuracy on AIME 2024, matching the performance of o3-mini, outperforming o1 and DeepSeek-R1.

Model: https://huggingface.co/RUC-AIBOX/STILL-3-TOOL-32B
Data: https://huggingface.co/datasets/RUC-AIBOX/STILL-3-TOOL-32B-Data
Code & Github: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

Data Construction

To construct the high-quality question set, we utilize the following principles to select the questions from Numina and OpenThoughts:

We utilize Qwen2.5-72B-Instruct to solve the each question 8 times, and only keep the questions whose answer is same with one of the answer from LLM, ensuring that the questions contain the correct answer and can be solved.
We remove the multiple choice question, proof question, concept question, open-ended question, and the question with multiple sub-problems. The answer of these problems are difficult to be verified.
We utilize Qwen2.5-7B to solve the question under in-context learning approach, and remove the questions that can be correctly answered by Qwen2.5-7B. This process can remove the easy question, which might not be useful in the training process.

Once we obtain the high-quality question set, we leverage the existing LLMs with slow-thinking ability to construct the training data, i.e., distilling from DeepSeek-R1 and synthesizing from DeepSeek-R1-Distill-Qwen-32B.

Distill from DeepSeek-R1

Our major goal is to construct the long-CoT solution with manipulating external tools (i.e., python code). Thus, an obvious approach is to leverage DeepSeek-R1 (R1) to generate the solutions, which satisfy our requirements. The tool integration format follows the methodologies outlined in PoT, ToRA, DeepSeekMath, Numina-TIR, and Qwen 2.5 Math.

To distill the training data from R1, we utilize the following prompt to instruct R1 to solve the problems with the help of code. Once R1 generate the response, we check whether the response contains the manipulation of code and provides the correct answer. If the response satisfy the above requirements, we add it into the training dataset.

PROMPT = """Analyze the following question through sequential logical steps. When encountering steps that require calculations, data manipulation, or empirical verification, you MUST generate executable Python code to validate your reasoning. Follow these guidelines precisely:

1. Thought Process Structure:
   - For each non-trivial computational/logical step:
     a) First explain the reasoning in natural language
     b) Then create a Python code block between ```python and ``` markers
     c) Immediately after, show the code's output between ```output and ``` markers
     d) Explicitly connect the output to the next reasoning step

2. Code Requirements:
   - Implement minimal verifiable examples (MVEs) to test specific hypotheses
   - Include print statements to display critical variables
   - Handle edge cases when applicable

3. Output Interpretation:
   - Quantitatively compare code results with your initial hypothesis
   - Resolve discrepancies before proceeding
   - Use exact numerical outputs to drive conclusions

Maintain clear separation between natural language explanations, code blocks, and output blocks. Present your final answer within \\\\boxed{}.\\n\\nQuestion: """

Synthesize from DeepSeek-R1-Distill-Qwen-32B

Since the expense of distilling from R1 is quite high, we synthesize the training data from DeepSeek-R1-Distill-Qwen-32B (Distill-Qwen), which is more cheap and efficient.

In our early experiment, we try to adopt the above prompt to directly instruct Distill-Qwen to generate the solution with code manipulation. However, we observe that it is hard to make Distill-Qwen follow our instruction. We guess that the supervised-finetuning process hurt the instruction following ability of Distill-Qwen in a certain degree.

Therefore, we utilize the rule-based approach to modify the generated responses to make them satisfied our requriments: