Unleashing LLM Speed: Multi-Token Self-Speculative Decoding Redefines Inference – ██FR█████ █INTELL███████████

Figure S10: Decoding speeds and latencies with self-speculative decoding relative to standard autoregressive decoding. We use k heads of a 4-token prediction model and evaluate decoding speeds of a code model as explained in Table S2. All numbers are relative to the autoregressive (k = 1) baseline with the same batch size.

\ Table S3: Relative speedups with self-speculative decoding with byte-level models on code. We prompt the 7B parameter models from Section 3.3 on 4096 sequences of 1024 bytes of code not seen during training, and generate completions consisting of 1024 bytes using greedy self-speculative decoding (Stern et al., 2018) as in Table S2. The speedup was evaluated at a batch size of 16.

:::info Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

:::

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

This content originally appeared on HackerNoon and was authored by Cosmological thinking: time, space and universal causation