R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

1 Monash University    2 School of Software, Beihang University    3 South China University of Technology    4 ZIP Lab, Zhejiang University Email: caesard216@gmail.com, bohan.zhuang@gmail.com

News:
(07/24/2025) 🎉 First version of the project is released on arXiv.

Abstract

Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM's confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85\% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.

Token-level Analysis

Token-level Analysis
Token-level consistency and speedup analysis. (a) shows the relationship between token-level consistency and decoding speedup across different LLM-SLM pairs on the AMC dataset. (b) presents the distribution of speedup ratios across individual samples from AMC. (c) illustrates the token counts for questions correctly answered by both the SLM and LLM.

Methodology

Method illustration
Overview of R-Stitch. Given a question with chain-of-thought (CoT) prompting, decoding alternates between a small language model (SLM) and a large language model (LLM) based on token-level confidence. Generation begins with the SLM. If the predicted token has low confidence, the system switches to the LLM, which overwrites the uncertain token and continues decoding. Conversely, when the LLM produces a high-confidence token, control is handed back to the SLM to reduce computational cost. This bidirectional switching mechanism enables dynamic resource allocation based on confidence, allowing R-Stitch~to retain the efficiency of the SLM while leveraging the reliability of the LLM when necessary.

Performance

Performance

BibTeX

@article{chen2025r,
        title={R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning},
        author={Chen, Zhuokun and Chen, Zeren and He, Jiahao and Tan, Mingkui and Cai, Jianfei and Zhuang, Bohan},
        journal={arXiv preprint arXiv:2507.17307},
        year={2025}
      }