result.png

Introduction

Recently, slow-thinking large language models have shown significantly improvement on complex reasoning tasks, like mathematical reasoning. However, the model released by DeepSeek cannot manipulate external tools during reasoning process. To further explore the potential ability of tool manipulation of slow-thinking model, we release this report and present the whole details of our experiment on enhancing the reasoning ability of slow-thinking model via tool manipulation. As a preliminary exploration, we focus on mathematical reasoning tasks and propose STILL-3-Tool-32B, a model can leverage python code to help its reasoning process. During evaluation, our model achieve 81.7% accuracy on AIME 2024, matching the performance of o3-mini, outperforming o1 and DeepSeek-R1.

Data Construction

To construct the high-quality question set, we utilize the following principles to select the questions from Numina and OpenThoughts:

Once we obtain the high-quality question set, we leverage the existing LLMs with slow-thinking ability to construct the training data, i.e., distilling from DeepSeek-R1 and synthesizing from DeepSeek-R1-Distill-Qwen-32B.

Distill from DeepSeek-R1

Our major goal is to construct the long-CoT solution with manipulating external tools (i.e., python code). Thus, an obvious approach is to leverage DeepSeek-R1 (R1) to generate the solutions, which satisfy our requirements. The tool integration format follows the methodologies outlined in PoT, ToRA, DeepSeekMath, Numina-TIR, and Qwen 2.5 Math.

To distill the training data from R1, we utilize the following prompt to instruct R1 to solve the problems with the help of code. Once R1 generate the response, we check whether the response contains the manipulation of code and provides the correct answer. If the response satisfy the above requirements, we add it into the training dataset.

PROMPT = """Analyze the following question through sequential logical steps. When encountering steps that require calculations, data manipulation, or empirical verification, you MUST generate executable Python code to validate your reasoning. Follow these guidelines precisely:

1. Thought Process Structure:
   - For each non-trivial computational/logical step:
     a) First explain the reasoning in natural language
     b) Then create a Python code block between ```python and ``` markers
     c) Immediately after, show the code's output between ```output and ``` markers
     d) Explicitly connect the output to the next reasoning step

2. Code Requirements:
   - Implement minimal verifiable examples (MVEs) to test specific hypotheses
   - Include print statements to display critical variables
   - Handle edge cases when applicable

3. Output Interpretation:
   - Quantitatively compare code results with your initial hypothesis
   - Resolve discrepancies before proceeding
   - Use exact numerical outputs to drive conclusions

Maintain clear separation between natural language explanations, code blocks, and output blocks. Present your final answer within \\\\boxed{}.\\n\\nQuestion: """

Synthesize from DeepSeek-R1-Distill-Qwen-32B

Since the expense of distilling from R1 is quite high, we synthesize the training data from DeepSeek-R1-Distill-Qwen-32B (Distill-Qwen), which is more cheap and efficient.

In our early experiment, we try to adopt the above prompt to directly instruct Distill-Qwen to generate the solution with code manipulation. However, we observe that it is hard to make Distill-Qwen follow our instruction. We guess that the supervised-finetuning process hurt the instruction following ability of Distill-Qwen in a certain degree.

Therefore, we utilize the rule-based approach to modify the generated responses to make them satisfied our requriments: