R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen^1*, Zeren Chen², Jiahao He⁴, Lu Sheng², Mingkui Tan³, Jianfei Cai¹, Bohan Zhuang^4†

¹ Monash University ² School of Software, Beihang University ³ South China University of Technology ⁴ ZIP Lab, Zhejiang University Email: caesard216@gmail.com, bohan.zhuang@gmail.com

arXiv Code BibTex

News:

(07/24/2025) 🎉 First version of the project is released on arXiv.

(09/27/2025) 📑 We release an updated version with more comprehensive experimental analysis of R-Stitch and an extended framework R-Stitch⁺. Check it out on arXiv.

Abstract

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch⁺, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00× on DeepSeek-R1-Distill-Qwen-7B, 3.85× on 14B, and 4.10× on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency–accuracy trade-offs that can be tailored to diverse computational budgets without retraining.

Token-level Consistency and Speedup Analysis

Speculative decoding has received considerable attention due to its potential for substantial speedups. However, its effectiveness critically depends on the consistency between the small language model (SLM) and the large language model (LLM). We quantify this limitation using token-level consistency, defined as the percentage of tokens for which the SLM produces the same output as the LLM given an identical prefix.

Figure 1 illustrates three aspects: (a) consistency–speedup trade-offs across different model pairs, (b) distribution of speedups across AMC samples, and (c) token length comparison for correctly answered questions. The results highlight that speculative decoding may suffer overheads when agreement is low, and fails to exploit the concise reasoning of SLMs effectively.

(a) Token-level consistency vs. speedup.

(b) Speedup distribution.

(c) Token usage by SLM vs. LLM.

Empirical Entropy Analysis

We analyze entropy patterns on AMC using DeepSeek-R1-Distill-Qwen-7B (LLM) and L1-1.5B-Short (SLM), revealing three key observations:

1. Incorrect answers show higher entropy. Reasoning traces leading to wrong answers have significantly higher mean entropy (Figure 2a).

2. Most tokens are near-deterministic. Over 89% of SLM tokens have entropy ≤ 0.1, indicating high prediction confidence (Figure 2b).

3. High-entropy tokens trigger errors. Harmful tokens are often preceded by locally elevated entropy, making entropy a useful routing signal (Figure 2c).

(a) Higher entropy in incorrect solutions.

(b) Most tokens are near-zero entropy.

(c) Elevated entropy before harmful tokens.

Figure 2. Entropy and error locality. (a) Incorrect solutions show higher entropy. (b) Most tokens have near-zero entropy. (c) Harmful tokens are preceded by high-entropy regions.

Methodology

Overview of R-Stitch — **Figure 3.** Overview of *R-Stitch*: entropy-guided bidirectional decoding between SLM and LLM.

R-Stitch — Entropy-Guided Decoding

R-Stitch is a token-level hybrid decoding framework. A small model (SLM) starts decoding to minimize latency. At each step, we compute the normalized token entropy H_t from the predictive distribution. If H_t is low, the SLM’s token is accepted; otherwise the step is delegated to the large model (LLM), which overwrites the token and continues decoding.

Crucially, switching is bidirectional: when the LLM enters a low-entropy region, control returns to the SLM to reduce cost again. Both models maintain their own KV caches; on a switch we reuse the previous cache and only partially prefill the new span, avoiding redundant attention on old tokens. This design jointly reduces per-token compute and overall sequence length while retaining LLM-level answer quality.

R-Stitch⁺ — RL-based Adaptive Routing

R-Stitch⁺ replaces fixed thresholds with a lightweight RL router that acts only when entropy is high. Given the current hidden state and context length, the router decides to continue with the SLM or to switch to the LLM, learning an adaptive policy that generalizes across prompts and budgets.

Training uses a latency-aware reward combining final-answer accuracy with an efficiency term derived from a simple cost estimator (prefill/decoding latency as a function of input length and KV size). This encourages policies that attain LLM-level accuracy at substantially lower end-to-end latency, and allows smooth efficiency–accuracy trade-offs without retraining the LMs.

Performance on Math Reasoning Benchmarks

Table 1. Main results on math reasoning benchmarks. R-Stitch achieves consistent speedups with minimal accuracy loss, especially at larger scales.

BibTeX

@article{chen2025r,
        title={R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning},
        author={Chen, Zhuokun and Chen, Zeren and He, Jiahao and Tan, Mingkui and Cai, Jianfei and Zhuang, Bohan},
        journal={arXiv preprint arXiv:2507.17307},
        year={2025}
      }