Rapid-MLX, Hermes & MLX — Architecture Study Wikireference

A reference-grade walk-through of how an OpenAI-compatible local LLM server is actually built on Apple Silicon: from MLX kernels at the bottom, through mlx-lm's generation loop, into Rapid-MLX's serving harness, up to the Hermes-style tool-call protocol that lets a model say "please run this function for me."

How to use this page Each section is self-contained — read top-to-bottom for the full story, or skip via the left nav. The concept layer sits above the code-level appendix, which contains annotated excerpts from the actual sources. All claims link to a source in the Sources section.

Overview & scope#

This wiki is a study companion for understanding how a modern, local, tool-capable LLM server actually fits together on Apple Silicon. It treats the five components named in the title — MLX, mlx-lm, Rapid-MLX, the agent / orchestrator loop, and the Hermes tool-call format — as a single layered system, and walks each layer in enough detail that you could plausibly debug or extend any of them.

Who this is for

Engineers who can read Python and have a rough idea what a transformer does, but haven't traced an inference request end-to-end through a real serving stack.
Anyone trying to reason about why "tool calling" works at all — what the model actually sees, what the harness actually does, where the contract lives.
People deciding whether to use, fork, or replace any of these layers and want a map first.

In scope

Topic	Depth here
MLX as an array framework (lazy eval, unified memory, transforms)	Conceptual — enough to know why it's fast, not how to write Metal kernels.
mlx-lm's generation loop, samplers, logits processors, KV cache	API surface + the autoregressive loop in pseudocode.
Rapid-MLX's serving architecture, prompt cache, parsers, cloud routing	From the README architecture diagram down to flag-level behaviour.
The agent loop (server-side vs client-side responsibilities)	Canonical loop in pseudocode + annotated Hermes `recursive_loop`.
Hermes tool-call protocol (system prompt, `<tool_call>`, `<tool_response>`)	Wire format with examples, plus how it's rendered through the chat template.
How "the LLM knows what tools exist"	End-to-end render of a tool schema into the exact tokens the model sees.

Out of scope (deliberately)

Training, fine-tuning, LoRA mechanics — mentioned in passing, not explained.
Metal shader internals, MLX kernel authoring.
Benchmarking methodology — numbers are quoted from the Rapid-MLX README without re-running them.
Alternative tool-call formats (Llama, DeepSeek, Harmony, etc.) beyond a table; Hermes is the worked example.
Production concerns (auth, multi-tenancy, observability) — Rapid-MLX has flags for these; this page doesn't dwell on them.

How to read

Top-to-bottom is the intended path: the mental model and five-layer diagram are the spine, and every later section refers back to them. If you only have ten minutes, read Mental model + End-to-end trace — together they're a complete-enough picture to navigate the rest later. If you're here to verify a specific claim, jump straight to the appendix for source excerpts and Sources for the underlying repos.

Caveat on accuracy This is a study reference assembled from public documentation. Where the code diverges from this description, the code is right. Versions move; the Rapid-MLX architecture diagram and parser list are accurate to the README at the time of writing, but the project is actively developed — re-check before relying on any specific behaviour.

The mental model#

Five components, each a thin abstraction over the one below it. The trick to understanding the whole stack is to see that each layer only knows about the one directly under it, and the protocol between an LLM and a tool is just a string contract enforced by careful prompt formatting and careful parsing.

┌────────────────────────────────────────────────────────────────────────┐ │ Client (Claude Code, Cursor, Aider, Open-WebUI, raw OpenAI SDK) │ │ speaks → OpenAI /v1/chat/completions (HTTP+JSON) │ └──────────────────────────────────┬─────────────────────────────────────┘ ▼ ┌────────────────────────────────────────────────────────────────────────┐ │ Rapid-MLX server · FastAPI · OpenAI-compatible surface │ │ ├── Cloud Router (optional, via litellm) │ │ ├── SimpleEngine prompt cache · KV trim · DeltaNet snapshots │ │ ├── Tool parsers (×17) hermes · llama · deepseek · harmony · … │ │ └── Reasoning parsers qwen3 · deepseek_r1 · minimax · harmony │ └──────────────────────────────────┬─────────────────────────────────────┘ ▼ ┌────────────────────────────────────────────────────────────────────────┐ │ mlx-lm · load() · generate() · stream_generate() │ │ ├── tokenizer.apply_chat_template(messages, tools, …) │ │ ├── sampler (logits → token) │ │ ├── logits_processors (token history + logits → logits') │ │ └── KV cache (rotating, optional 4/8-bit quantized) │ └──────────────────────────────────┬─────────────────────────────────────┘ ▼ ┌────────────────────────────────────────────────────────────────────────┐ │ MLX · array framework for Apple Silicon │ │ ├── lazy evaluation (graph built, only materialized on mx.eval) │ │ ├── unified memory (no host↔device copies, zero-copy GPU) │ │ ├── composable transforms (grad · vmap · jit) │ │ └── Metal kernels (matmul, attention, quantized dequant) │ └──────────────────────────────────┬─────────────────────────────────────┘ ▼ Apple Silicon: M1/M2/M3/M4 — CPU + GPU + ANE, one memory pool

The single most useful insight Everything above MLX is just string manipulation, JSON, and a loop. The model never literally "calls a tool" — it emits text that the harness recognises as a tool call, executes externally, and feeds back as more text. The model's only "ability" is producing tokens that match a contract.

Five layers of the stack#

1 · MLX

Arrays, autograd, Metal kernels. Apple's NumPy/JAX for unified memory.

2 · mlx-lm

LLM weights → tokens. Generation loop, KV cache, samplers, chat templates.

3 · Rapid-MLX

OpenAI-compatible HTTP server. Prompt cache, tool parsers, cloud routing.

4 · Agent loop

The recursion: model → tool call → execute → result → model → …

5 · Hermes format

The XML/JSON contract for tool advertisement and tool calls.

1 · Apple MLX — the foundation#

MLX is an array framework for Apple silicon, built by Apple Machine Learning Research. Think of it as NumPy + autograd + Metal, but designed from day one for the Apple unified-memory architecture instead of being a CUDA framework retrofitted onto a Mac.

The five properties that matter

Property	What it means	Why an inference engine cares
Familiar APIs	Python API mirrors NumPy; `mlx.nn` mirrors PyTorch.	Almost zero porting cost from a PyTorch reference implementation.
Lazy computation	Operations build a graph; results materialise only when `mx.eval()` runs (or a value is read).	Lets MLX fuse kernels, eliminate intermediate allocations, and reorder ops.
Dynamic graphs	Graphs are constructed every call; shape changes don't recompile.	Variable-length sequences (the norm in LLM decoding) cost nothing extra.
Multi-device	Same array can run on CPU or GPU; no `.to(device)`.	Preprocessing on CPU and attention on GPU share the same buffer.
Unified memory	Arrays live in a single shared address space.	No host↔device copy of the KV cache, ever — this is the largest single win for decode-side perf.
Composable transforms	`grad`, `vmap`, `jit` compose like JAX.	Same primitive supports training, fine-tuning (LoRA), and inference paths.

Why unified memory is the whole game

On a discrete-GPU system, model weights live in GPU VRAM and the host RAM holds the request queue, tokenizer state, and KV cache scaffolding. Every iteration shuffles bytes across PCIe. Apple Silicon has one physical memory pool addressable by both CPU and GPU, and MLX exposes that as the architectural primitive — there is literally no "move to GPU" call. The KV cache, which grows linearly with context length, never has to be transferred between devices. This is the structural reason an MLX-native engine can beat a generic Metal-shader engine like llama.cpp's MPS path on most models, even when the latter is highly tuned.

What MLX is not MLX is not a model zoo, not a serving stack, not a tokenizer. It is the array layer. Everything else — chat templates, KV caches, samplers, HTTP — is built on top by mlx-lm, mlx-vlm, and engines like Rapid-MLX.

2 · mlx-lm — the model runtime#

mlx-lm is a Python package that turns "a folder of weights from Hugging Face" into "a Python function that produces tokens." It is the layer that owns: the tokenizer, the chat template, the autoregressive loop, the KV cache, and the sampler.

Three functions are the whole API

pythonfrom mlx_lm import load, generate, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

messages = [{"role": "user", "content": "Write a story about Einstein"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

# one-shot
text = generate(model, tokenizer, prompt=prompt, verbose=True)

# streaming
for response in stream_generate(model, tokenizer, prompt, max_tokens=512):
    print(response.text, end="", flush=True)

The autoregressive loop, in spirit

Every LLM inference engine is some variation of this loop. mlx-lm's implementation is a clean reference version of it.

python · pseudocodedef generate(model, tokenizer, prompt, max_tokens, sampler, logits_processors):
    tokens = tokenizer.encode(prompt)
    kv_cache = make_cache(model)              # layer-wise KV buffers

    # PREFILL: process the whole prompt in chunks of --prefill-step-size
    for chunk in chunked(tokens, prefill_step_size):
        logits = model(chunk, cache=kv_cache)

    # DECODE: one token at a time, until EOS or limit
    for _ in range(max_tokens):
        logits = model(tokens[-1:], cache=kv_cache)        # shape: [1, 1, vocab]
        for proc in logits_processors:
            logits = proc(tokens, logits)                  # e.g. tool logits bias
        next_token = sampler(logits)                       # temp / top-p / argmax
        if next_token == tokenizer.eos_token_id: break
        yield next_token
        tokens.append(next_token)

The pieces, named

tokenizer.apply_chat_template(messages, …) — turns a list of role/content dicts (and optionally a tools=… argument) into a single token string the model was trained to recognise. The template is a Jinja file shipped with the model; add_generation_prompt=True appends the "assistant:" preamble.
Sampler — any callable (logits) → token. Temperature, top-p, top-k, min-p, argmax. mlx_lm.sample_utils ships standard ones.
Logits processors — an ordered list of (history, logits) → logits. Repetition penalty lives here, and so does Rapid-MLX's tool logits bias, which nudges the model toward structured tokens like <tool_call> at the moment it should be opening one.
KV cache — per-layer key/value tensors that grow with sequence length. mlx-lm supports a rotating fixed-size cache (--max-kv-size) for long generations.
Prompt cache — serialise the KV state to disk (mlx_lm.cache_prompt) so a long system prompt only gets prefilled once. This is the seed Rapid-MLX builds its in-memory prompt cache on.

Why this matters for the rest of the wiki Every "smart" thing Rapid-MLX does — tool logits bias, prompt cache, streaming tool parsing — is implemented by inserting itself into one of the four hooks above: the chat template, the logits processors, the cache, or the post-decode stream.

3 · Rapid-MLX — the serving harness#

Rapid-MLX (raullenchai/Rapid-MLX) is a fork of waybarrios/vllm-mlx that wraps mlx-lm and mlx-vlm in an OpenAI-compatible HTTP server, then aggressively layers performance and reliability tricks on top. The package directory is vllm_mlx/.

The architecture, from the README

┌──────────────────────────────────────┐ │ OpenAI-compatible API (port 8000) │ │ /v1/chat/completions, /v1/models │ └──────────────────┬───────────────────┘ │ ┌────────┴────────┐ │ Cloud Router │ (optional) │ new_tokens > │ │ threshold? │ └───┬─────────┬───┘ yes │ │ no ┌────────────┘ └──────────────┐ ▼ ▼ ┌─────────────────┐ ┌──────────────────────┐ │ Cloud LLM │ │ Local MLX Engine │ │ (via litellm) │ │ │ │ GPT-5, Claude, │ │ ┌────────────────┐ │ │ Gemini, etc. │ │ │ SimpleEngine │ │ └─────────────────┘ │ │ + prompt cache │ │ │ └───────┬────────┘ │ │ │ │ │ ┌───────┴────────┐ │ │ │ mlx-lm/mlx-vlm│ │ │ │ MLX + Metal │ │ │ └────────────────┘ │ └──────────────────────┘

What Rapid-MLX owns that mlx-lm doesn't

Concern	Implementation
HTTP surface	FastAPI, OpenAI `/v1/chat/completions`, `/v1/models`, streaming SSE.
Persistent state	In-memory prompt cache keyed by message prefix; restored across requests.
Tool-call extraction	17 parsers, one per model family (hermes, llama, deepseek, harmony, kimi, glm47, minimax, …). Auto-selected from model name.
Reasoning extraction	Separate parsers for `<think>`-style chain-of-thought, surfaced as `reasoning_content` (never mixed into `content`).
Recovery	If a 4-bit quantized model emits a malformed tool call as plain text, the parser auto-converts it back to structured `tool_calls` JSON.
Routing	If `new_tokens > --cloud-threshold`, the request is shipped to a cloud LLM via litellm instead of running locally.
Streaming hygiene	Think-tag filter, chunk-boundary leak fix, developer role normalisation, disconnect guard.

SimpleEngine: the heart of the server

SimpleEngine is the boundary class. It accepts an OpenAI chat-completion request, decides whether to use the cache, runs the mlx-lm generation loop with the right logits processors and sampler, and emits a stream of tokens that the parser layer turns back into a structured response. Everything else — vision, audio, embeddings — sits beside SimpleEngine as a sibling and is dispatched by route.

How parsers get picked Parser selection is by model-name regex at startup. Qwen3.5-* → hermes + qwen3 reasoning. DeepSeek-R1 → deepseek + deepseek_r1 reasoning. GPT-OSS → harmony. Explicit --tool-call-parser always overrides. Hermes is the most widely compatible format, so Mistral, Devstral, Gemma, Phi-3/4 all use it.

Server flags that change behaviour at runtime

Flag	What it does	When you turn it on
`--enable-tool-logits-bias`	Logits processor that biases toward structured tokens (e.g. `<tool_call>` opener) once a tool call is detected starting.	Speed + reliability of tool-emitting models.
`--prefill-step-size`	Tokens processed per prefill chunk (default 2048).	Larger = faster cold start, more peak memory.
`--kv-bits 4\|8`	Quantize the KV cache.	Long contexts on small memory budgets.
`--draft-model`	Speculative decoding draft model.	2× decode boost on compatible model pairs.
`--cloud-model` + `--cloud-threshold`	Spill long-context requests to a cloud LLM.	You want fast latency on small chats and large-context fall-through.
`--mcp-config`	Wire in an external Model Context Protocol tool catalog.	Letting the server itself surface tools to clients.

4 · The agent loop (orchestrator)#

"Agent" and "orchestrator" are overloaded words. In this stack they have two distinct meanings depending on which side of the API you stand on. Untangling them is half the battle.

	Server-side loop (Rapid-MLX)	Client-side loop (Claude Code, Aider, …)
Owns	Token sampling, parser, streaming, prompt cache.	Tool schemas, tool execution, multi-turn planning, user UI.
Inputs	OpenAI chat request (messages + tools).	User prompt + filesystem + git + shell.
Outputs	Structured `tool_calls` or final `content`.	Edits, diffs, runs, follow-up messages.
Loop trigger	Each HTTP call is one model turn.	If response contains tool_calls → execute → re-call server.

The Rapid-MLX server is stateless per turn. It receives the whole transcript every time, runs the model once, returns either content or tool_calls, and forgets. The orchestrator is whichever client is driving — Claude Code, Cursor, Aider, OpenCode, or your own script. This is why "drop-in OpenAI replacement" works: the client already knows how to run the agent loop against any OpenAI-compatible endpoint.

The canonical agent loop, on either side

python · pseudocodedef agent_loop(user_query, tools, max_depth=5):
    messages = [
        {"role": "system", "content": system_prompt_with_tools(tools)},
        {"role": "user",   "content": user_query},
    ]
    for step in range(max_depth):
        resp = openai_chat_completion(messages=messages, tools=tools)
        msg  = resp.choices[0].message
        messages.append(msg)                                          # assistant turn

        if not msg.tool_calls:
            return msg.content                                       # done

        for call in msg.tool_calls:
            result = dispatch(call.name, call.arguments)              # execute
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })
    raise MaxDepthExceeded()

Where the recursion lives In NousResearch/Hermes-Function-Calling the loop is named recursive_loop and lives inside generate_function_call. Each level either (a) finds tool calls, executes them, appends a <tool_response> turn, and recurses, or (b) decides the model is done. Max depth defaults to 5. The annotated source is in the appendix.

5 · The Hermes tool-call format#

"Hermes format" is the protocol developed by Nous Research for their Hermes-2-Pro / Hermes-3 models. It is a conventional protocol — there's no magic, just a system prompt and two XML tags that the model is trained to respect. Rapid-MLX uses this format as its default for Qwen, Mistral, Devstral, Gemma, and Phi-3/4 because they all tolerate it well.

The three pieces of the contract

1. The system prompt

Tells the model that it is a function-calling agent and lists every available tool. Tools are serialised as JSON Schema-flavoured signatures inside a <tools> tag.

text · system messageYou are a function calling AI model. You are provided with function
signatures within <tools></tools> XML tags. You may call one or more
functions to assist with the user query. Don't make assumptions about
what values to plug into functions.

<tools>
{"type": "function", "function": {
   "name": "get_stock_price",
   "description": "Get the current stock price for a ticker symbol",
   "parameters": {
     "type": "object",
     "properties": {"symbol": {"type": "string"}},
     "required": ["symbol"]
   }
}}
</tools>

For each function call return a json object with function name and
arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>{"name": "<function-name>", "arguments": <args-dict>}</tool_call>

2. The assistant's tool call

text · assistant messageI'll look that up for you.
<tool_call>
{"name": "get_stock_price", "arguments": {"symbol": "TSLA"}}
</tool_call>

3. The tool's response, fed back as the next turn

text · tool message<tool_response>
{"name": "get_stock_price", "content": {"symbol": "TSLA", "price": 312.04}}
</tool_response>

Why XML tags around JSON?

Two reasons. First: greppability. <tool_call>…</tool_call> is trivially findable by streaming parsers even mid-token. Second: state machine clarity. The model sees a clear "I am now in tool-call mode" boundary, which empirically helps small/quantized models stay structured. JSON inside gives the args their type discipline.

Quantization is where this breaks 4-bit quantized models routinely emit tool_call JSON without the surrounding tags, or with attribute keys subtly wrong. Rapid-MLX's "auto tool recovery" pass catches these — pattern-matching the JSON-shaped chunk in the model's plain-text content and reconstructing the structured tool-call envelope before returning to the client. Per the README, this is what gets quantized Qwen3.5 to 100% tool-call success.

Multiple calls per turn, parallel tools

The model may emit several <tool_call> blocks in a single assistant turn — the orchestrator should execute all of them and return all <tool_response> blocks in the next user turn. Modern Hermes-trained models handle this natively.

How the LLM knows what it can call#

This is the section the title of the wiki is really about. There is no magic introspection — the model sees only the tokens you give it. So "what the harness offers" is a function of three rendering decisions:

Schema rendering — the orchestrator (or the chat template) renders each tool's JSON Schema into the system prompt, inside <tools>…</tools>.
Template binding — the chat template (Jinja, shipped with the model) decides exactly how the tools block is interleaved with the system instructions and user messages. Most Hermes-trained chat templates accept a tools=… kwarg to apply_chat_template.
Training — the model has been fine-tuned on conversations that follow this exact format, so it has learned to (a) emit a <tool_call> block when calling a tool, (b) wait for a <tool_response>, (c) emit normal content when answering.

Walk-through: a single tool, end-to-end

python · client# 1) Client describes the tool in OpenAI form
tools = [{
  "type": "function",
  "function": {
     "name": "read_file",
     "description": "Read a file from disk.",
     "parameters": {
        "type": "object",
        "properties": {"path": {"type": "string"}},
        "required": ["path"],
     },
  },
}]
resp = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What's in README.md?"}],
    tools=tools,
)

text · what the model actually sees (after chat template)<|im_start|>system
You are a function calling AI model. ...
<tools>
{"type":"function","function":{"name":"read_file","description":"Read a file from disk.","parameters":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}}}
</tools>
For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags ...
<|im_end|>
<|im_start|>user
What's in README.md?<|im_end|>
<|im_start|>assistant

text · what the model emits<tool_call>
{"name": "read_file", "arguments": {"path": "README.md"}}
</tool_call><|im_end|>

json · what Rapid-MLX returns to the client{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "read_file",
          "arguments": "{\"path\": \"README.md\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

The harness has to do three things right (1) Render the tools into the prompt so the model knows they exist. (2) Detect <tool_call> in the stream and switch parser state. (3) Reshape the model's output into OpenAI's tool_calls JSON before returning. Rapid-MLX does all three; the model just produces tokens.

End-to-end trace#

One full round of "user asks → tool gets called → user gets answer," with every layer's responsibility labelled.

User: "Read README.md and summarise it" │ ▼ Claude Code (client orchestrator) • Builds messages[] with system + user • Attaches tools=[read_file, write_file, run_shell, ...] • POSTs /v1/chat/completions to localhost:8000 │ ▼ Rapid-MLX · FastAPI route • Cloud router: new_tokens < threshold → stay local • SimpleEngine.handle(request) │ ▼ SimpleEngine • Looks up prompt cache by messages[] prefix • Hit: restore KV cache (transformer) or DeltaNet snapshot • Builds final token sequence via tokenizer.apply_chat_template( messages, tools=tools, add_generation_prompt=True) │ ▼ mlx-lm · stream_generate • PREFILL only the new suffix (cache trick) • DECODE token by token through MLX kernels • Each step: sampler(logits_processors(history, logits)) │ ▼ MLX · Metal • Lazy graph: matmul, attention, RMSNorm fused into Metal shaders • Unified memory: KV cache appended in place, no copy │ ▼ Token stream: "<tool_call>{"name":"read_file","arguments":{"path":"README.md"}}</tool_call>" │ ▼ Hermes tool parser • Detects <tool_call> opener mid-stream • Buffers JSON until </tool_call> • Validates name + args against the provided tools schema • Auto-recovers if JSON is malformed (e.g. quantization noise) │ ▼ HTTP response: {"message": {"tool_calls": [{"function": {...}}]}, "finish_reason": "tool_calls"} │ ▼ Claude Code (orchestrator) • Sees finish_reason="tool_calls" • Looks up read_file → opens "./README.md" → reads bytes • Appends {"role":"tool", "tool_call_id":"...", "content":"<file contents>"} to messages • POSTs /v1/chat/completions again — same prefix, longer suffix │ ▼ SimpleEngine: prompt cache hits the long prefix, prefills only the tool turn │ ▼ mlx-lm: model emits plain content this time │ ▼ User sees: "The README documents a local OpenAI-compatible LLM server …"

Parsers & recovery in detail#

Tool-call parsers are the most subtle part of the server. They run as a streaming state machine over the decoded tokens, and they're the only thing between "model emitted text" and "client receives structured JSON." There are 17 of them in Rapid-MLX, one per major model family.

Parser	Native format	Models
`hermes`	`<tool_call>{json}</tool_call>`	Qwen3.5, Mistral, Devstral, Gemma, Phi-3/4, Hermes-3
`llama`	JSON only, often `{"name": ..., "parameters": ...}`	Llama 3.x
`deepseek` / `deepseek_v31`	Family-specific JSON wrappers	DeepSeek V2.5, V3, V3.1, R1
`harmony`	OpenAI's open-weight Harmony channel format	GPT-OSS
`minimax`	XML-flavoured tool format	MiniMax-M2.5
`glm47`	GLM-family tool format	GLM-4.7
`kimi`	Kimi-Linear tool format	Kimi-Linear

The state machine

python · pseudocode (hermes)class HermesParser:
    OPEN  = "<tool_call>"
    CLOSE = "</tool_call>"

    def __init__(self, tools_schema):
        self.state   = "content"     # content | in_call
        self.buf     = []
        self.content = []
        self.calls   = []
        self.schema  = tools_schema

    def feed(self, token_text):
        if self.state == "content":
            if looks_like_open(token_text, self.OPEN):
                self.state = "in_call"
                return []
            self.content.append(token_text)
            return [stream_event("content", token_text)]

        else:                          # in_call
            self.buf.append(token_text)
            if self.CLOSE in "".join(self.buf):
                raw = self.buf_until_close()
                call = recover_json(raw)        # tolerant parse
                if validate(call, self.schema):
                    self.calls.append(call)
                self.state = "content"; self.buf = []
            return []

Auto-recovery — what "100% tool calling" actually means

The recovery pass runs after generation if the model produced something that smells like a tool call but didn't conform. Patterns it handles:

Missing opening tag: {"name":"x","arguments":{...}} emitted as plain content → wrap with <tool_call>.
Markdown-fenced JSON: ```json\n{...}\n``` with no XML at all → extract and structure.
Truncated close tag: <tool_call>{json}</tool (EOS hit early) → close synthetically if JSON is valid.
parameters vs arguments key drift → normalise to arguments for OpenAI compat.

This is why a 4-bit quantized Qwen3.5 model can hit 100% tool-call success in Rapid-MLX's evals — the model occasionally fumbles the formatting, the parser silently fixes it, and the client never sees the mess.

Performance techniques#

The README lists nine optimisation techniques. Three of them are conceptually interesting enough to study; the rest are configuration knobs.

Prompt cache · KV trim

For a standard transformer, the KV cache at position n only depends on tokens 0..n. So if turn 2 starts with the same 10,000 tokens of system+history as turn 1, you can literally reuse the KV cache from turn 1 and only prefill the new suffix. Rapid-MLX hashes the message prefix and trims its in-memory cache to the longest common prefix. README claims 2–5× faster TTFT.

DeltaNet state snapshots

Qwen3.5 uses Gated DeltaNet (an RNN-style layer) for 75% of its layers and full attention for the other 25%. RNN state isn't "trimmable" the way KV is — you can't slice off the last k rows because each step depends on all prior steps. Rapid-MLX's trick: deep-copy the RNN state at the system-prompt boundary the first time you see it, and on subsequent requests, restore the snapshot in ~0.1 ms instead of re-running hundreds of tokens through the recurrent path. README reports 1.5–4.8× TTFT speedup on Qwen3.5 variants — it's the first prompt-cache implementation for hybrid RNN architectures on MLX.

Tool logits bias (jump-forward decoding)

Once the parser detects the model is starting a <tool_call> sequence, it knows the next several tokens must be the opening JSON structure. A logits processor biases those tokens upward — or in the limit, force-decodes them — skipping samples for tokens whose value is already determined by the format. This is both a speedup and a reliability win (the structure can't go wrong).

The composition pattern All three of these are implemented by reaching into one of mlx-lm's hooks: prompt cache wraps the KV cache, DeltaNet snapshots add a parallel cache for RNN state, tool logits bias is a logits processor. The agent loop doesn't need to know — it just sees a fast OpenAI server.

How open models learn — distillation#

A short detour. The rest of the wiki is about running models; this section is about how the specific models Rapid-MLX serves — DeepSeek-R1 distilled variants, Hermes-3-Llama, Qwen3.5, GPT-OSS — got to be small, fast, and good. Distillation is the single most important reason a 7B model on your laptop can hold its own against a 70B model from last year.

Core insight Distillation is not a separate model architecture. It's a training recipe: take a small student, supervise it on the outputs of a much larger teacher, and (importantly) on the teacher's reasoning traces. The student inherits the teacher's behaviour without inheriting its parameter count.

Three flavours, in order of "openness"

Flavour	What the student sees from the teacher	Requires
White-box (logit) distillation	Full output distribution per token, often via KL-divergence loss against a temperature-softened teacher.	Teacher weights or at least logits exposed.
Feature distillation	Hidden-state matching: align student layer activations to teacher layer activations.	Teacher weights and architectural compatibility.
Black-box (response) distillation	Only the teacher's sampled outputs — text completions, sometimes with reasoning chains.	Only an API. Works against closed models.

For open models in 2026 the dominant flavour is black-box distillation on synthetic data, usually augmented with the teacher's chain-of-thought traces. It's cheap (one inference pass per training sample), it works against any teacher you can prompt, and the resulting supervision is high-signal because the student is essentially learning to imitate a strong policy.

The recipe in pseudocode

python · pseudocode# 1) Curate diverse prompts that cover the capability you want.
prompts = load_prompts(domains=["math", "code", "tool-use", "chat", ...])

# 2) Run the TEACHER over every prompt. Keep reasoning + final answer.
teacher = load("big-frontier-model")
samples = []
for p in prompts:
    out = teacher.generate(p, enable_thinking=True)
    if verify(out):                            # reject unsupported answers
        samples.append({"prompt": p,
                        "reasoning": out.cot,
                        "answer": out.final})

# 3) Supervised fine-tuning on the STUDENT.
student = load("qwen2.5-7b")                    # or llama-3.1-8b, etc.
sft_train(student, samples, loss="crossentropy")

# 4) (Optional) add KL term against teacher logits if available.
# 5) Optionally: RLHF / DPO on top. DeepSeek-R1 distillation skipped this.

Three details matter more than they sound:

Verification before training. Synthetic data is only useful if it's right. The recent generation of distillation pipelines all add some form of reject-sampling — run the teacher many times, keep only outputs that pass a verifier (a checker for math, a compiler for code, a function-call schema check for tool use).
Teach the reasoning, not just the answer. Including the teacher's chain-of-thought in the training data — not just the final token — is what transfers the capability, not just the answer for one prompt. This is the DeepSeek-R1 finding in one sentence.
Mix the data. Pure distillation on one capability erodes the others. Real recipes mix general-purpose chat data with the capability you're targeting.

Case study · DeepSeek-R1 distillation

The most influential recent example, and a clean demonstration of why distillation works. DeepSeek took its large RL-trained R1 reasoning model as the teacher, started from six open-source base students (Llama-3.1 8B and 70B, Llama-3.3, Qwen-2.5 1.5B / 7B / 14B / 32B), generated ~800,000 high-quality reasoning traces from R1, and supervised-fine-tuned the students on those traces. No RL on the students. The released family — DeepSeek-R1-Distill-Qwen-{1.5B, 7B, 14B, 32B} and DeepSeek-R1-Distill-Llama-{8B, 70B} — is what Rapid-MLX, Ollama, and llama.cpp actually run when someone says "I'm using R1 locally."

The headline finding In the R1 technical report, the team explicitly compared "distill the small model from the big one" against "run the same RL recipe directly on a small model" and found the distilled version wins decisively. The intuition: RL needs the model to already be able to produce occasional good outputs to reward — small models often can't, but they can imitate.

Case study · Hermes 2 Pro / Hermes 3

Nous Research's Hermes line is the worked example of capability-targeted distillation. Hermes 2 Pro and Hermes 3 are built on Llama 3.1 (8B / 70B / 405B) and trained primarily on synthetically generated responses. The function-calling capability covered earlier in this wiki — the <tool_call> / <tool_response> protocol — was instilled by the openly released hermes-function-calling-v1 dataset: a mix of single-turn and multi-turn function-calling conversations, JSON-mode samples, agentic JSON-mode, and structured extraction. The Hermes 2 Pro reports show 90% on a function-calling eval built with Fireworks.AI and 84% on structured JSON output.

The lesson: the tool-call protocol described in this wiki only works because the model was trained on a dataset that uses it. The XML tags, the JSON shape, the multi-turn convention — none of it would be reliable if the model hadn't seen thousands of correctly-formatted examples during fine-tuning. Hermes is the canonical "how you teach a model to call tools" recipe, and it's why so many other model families (Mistral, Devstral, Gemma, Phi-3/4) work with the same parser in Rapid-MLX.

Brief mentions · Qwen and GPT-OSS

Qwen / Qwen3.5 — Alibaba's series uses a heavy synthetic-data + multi-stage post-training pipeline. The "thinking" variants (Qwen3.5-A3B and similar) emit reasoning blocks before answers; the chat template handles enable_thinking as a flag. Distillation from larger Qwen teachers to smaller Qwen students is part of how the small variants stay competitive.
GPT-OSS — OpenAI's open-weight release uses the Harmony channel format and is itself a distilled student of a larger internal teacher. The Harmony parser in Rapid-MLX is the runtime counterpart to that training format.

Why this matters for Rapid-MLX (the link back)

Almost every model Rapid-MLX serves on consumer hardware is the product of a distillation pipeline followed by quantization. Two specific connections worth holding in mind:

Compounding compression. A 4-bit DeepSeek-R1-Distill-Qwen-7B running in Rapid-MLX has been compressed twice: from 671B → 7B (distillation) and then from FP16 → INT4 (quantization). Most of the user-visible quality loss is from the first step; quantization is comparatively cheap, which is what makes "fit a frontier-quality model in 16 GB" plausible at all.
Speculative decoding's draft model is usually a distilled sibling. When you pass --draft-model to Rapid-MLX, the right choice is almost always a small distilled variant of the same family (e.g. Qwen3.5-1.5B drafting for Qwen3.5-9B). Distillation gives the draft and target models similar token preferences, which is exactly what raises the acceptance rate that makes speculative decoding pay off.

Limits to internalise Distillation is not magic. A student can imitate behaviours its teacher demonstrates, but it doesn't acquire knowledge the teacher didn't surface in the training data. Long-tail factuality, rare languages, and unusual reasoning patterns are the predictable weak spots of distilled small models — exactly the cases where a cloud router (--cloud-model) earns its keep.

The contested side — frontier-lab "weight theft" claims#

A separate but adjacent topic. Several frontier labs (OpenAI, Anthropic, Microsoft) have publicly alleged that open-model labs — DeepSeek, Moonshot, MiniMax among the named — trained their models by distilling from frontier APIs in violation of those APIs' terms of service. This section walks through what's being alleged, how it would technically be accomplished, and what the public counter-arguments are. Everything below is presented as claims and disputes, not as established fact — the accused parties contest the accusations, and as a study reference this page deliberately stays balanced.

Framing this honestly "Theft" is the framing used by the accusing labs. The accused dispute it. The underlying technique — black-box distillation from API outputs — is exactly the same as the openly-acknowledged Alpaca / Vicuna lineage and is technically indistinguishable from the legitimate distillation in the previous section. What separates "research milestone" from "alleged theft" is whether the API's terms of service permitted training a competing model on its outputs, plus questions of scale and access method. Read this section as a map of the public dispute, not a verdict.

Vocabulary check

"Weight theft" is a misnomer. No-one alleges that DeepSeek extracted the literal floating-point parameters of GPT-4 or Claude. Frontier weights have never been exposed; they couldn't be copied. What's alleged is behavioural theft: capturing the model's outputs at scale and training a student on them, so that the student inherits the teacher's behaviour without inheriting its weights.
Distillation vs. model extraction. Academic "model extraction attacks" try to recover weights or a near-functional clone of a classifier from queries. LLM distillation is different — the goal isn't weight recovery, it's capability transfer.
ToS violation ≠ legal violation. Whether breaching an API's terms of service rises to misappropriation, copyright infringement, or trade-secret theft is jurisdictionally unsettled and actively litigated.

How it would technically be accomplished

The same recipe as the previous section, applied without the teacher's permission. Stripped to its core:

python · pseudocode# 1) Acquire API access at scale. Often via proxy networks, reseller
#    accounts, or third-party routers (OpenAI's memo to Congress alleges
#    DeepSeek used "obfuscated routers" to circumvent access controls).
clients = pool_of_api_keys(via="intermediaries")

# 2) Generate diverse, capability-targeted prompts.
#    Often a smaller open model produces the prompts to multiply scale.
prompts = synth_prompts(seed=human_curated, expand_with="open-7b-model")

# 3) Query the frontier API at scale; capture outputs.
#    Reasoning models (o1, R1) expose chain-of-thought in some surfaces;
#    capturing that CoT is what makes the distilled student strong.
samples = []
for p in prompts:
    r = clients.chat.completions.create(model="frontier", messages=p, ...)
    samples.append({"prompt": p,
                    "reasoning": r.message.reasoning_content,
                    "answer":    r.message.content})

# 4) Verify / reject-sample. Math checked symbolically, code by execution,
#    function-calls by schema validation.
samples = [s for s in samples if verify(s)]

# 5) SFT a smaller OPEN base model on the harvested data.
student = load("llama-3.1-8b-base")         # or qwen, etc.
sft_train(student, samples)

# 6) Release the student weights as "open source." Without disclosure of
#    where the training data came from, an audit can only infer it from
#    behavioural tells.

The thing to internalise: steps 1, 3, and 6 are the only steps that distinguish this from a legitimate research recipe. Steps 2, 4, and 5 are identical to how the openly-distributed DeepSeek-R1-Distill family was made (with R1 as the consenting teacher). The whole legal/ethical dispute is compressed into "who gave permission for step 3, and was step 1 obtained honestly."

Public examples for study

Stanford Alpaca (March 2023) — openly acknowledged

The seminal worked example. Stanford fine-tuned LLaMA-7B on 52,000 instruction-following examples generated by OpenAI's text-davinci-003, using the Self-Instruct prompt-expansion method. Total cost reportedly under $600. Stanford was transparent about the methodology and explicitly noted that the resulting weights couldn't be released for commercial use because of OpenAI's terms. Capability was "comparable to GPT-3.5 on many tasks." This is the canonical "API-distilled small model" recipe; everything since is a variation.

Vicuna (UC Berkeley / CMU / Stanford / UCSD, 2023) — openly acknowledged

LLaMA fine-tuned on ~70,000 user-shared ChatGPT conversations scraped from ShareGPT. Same general pattern as Alpaca, more data, more conversational. Again, methodology was published openly; the release skirted ToS by framing the work as research, not commercial deployment.

The Berkeley "False Promise of Imitating Proprietary LLMs" paper (2023) — the skeptical counterweight

A widely-cited UC Berkeley paper that trained imitation models and evaluated them carefully. The headline finding: imitation models match the style of the teacher (tone, formatting, refusal patterns) far more easily than they match the capability. On hard benchmarks, the gap stays large. This is the empirical reason to be skeptical of the strongest version of the "DeepSeek just copied OpenAI" framing — if pure imitation hit a capability ceiling in 2023, the explanation for R1's actual benchmark performance has to involve more than copying.

OpenAI / Microsoft → DeepSeek (January 2025 onward) — contested

Shortly after the DeepSeek-R1 launch in January 2025, OpenAI and Microsoft publicly alleged that R1 had been trained in part on ChatGPT/o1 outputs obtained via distillation. Microsoft's security team reportedly observed unusual bulk-extraction patterns on OpenAI infrastructure tied to accounts associated with DeepSeek. In February 2026 OpenAI escalated by submitting a memo to the U.S. Congress China Select Committee alleging continued violations, including the use of "obfuscated routers" to bypass access controls. The cited evidence: stylistic resemblance between R1's reasoning traces and o1, performance trajectories the memo describes as inconsistent with pure-from-scratch training, and the API access logs above. DeepSeek has not publicly conceded the allegations; the case rests on circumstantial evidence and ToS arguments rather than seized training data.

Anthropic → DeepSeek, Moonshot, MiniMax (February 2026) — contested

Per CNBC reporting, Anthropic has alleged similar distillation activity by DeepSeek, Moonshot, and MiniMax against Claude. The accused labs dispute these characterisations. As with the OpenAI case, the public record consists of claims rather than disclosed forensic detail.

"Model claims to be ChatGPT" — the most informal evidence

The lightest-touch behavioural fingerprint: several open and quasi-open models from 2023–2025 would, when asked, identify themselves as ChatGPT, GPT-4, or similar — a strong indicator that ChatGPT-style outputs appeared in their fine-tuning data with the assistant identifying itself by name. It's not proof of large-scale unauthorised distillation by itself, but it's the kind of artefact that gets pointed at in the discourse.

Detection — how labs argue they can tell

Method	What it can show	What it can't
API access auditing	Patterns of bulk querying, suspicious account chaining, IP forensics. Microsoft's case against DeepSeek-affiliated accounts is reportedly built here.	The data left over after the queries — the trained student — can't be tied back to specific calls.
Output watermarking	Embed a statistical signal in the teacher's token probabilities (e.g. sinusoidal perturbations detectable by Fourier transform of the suspect model's outputs). Distillation-Resistant Watermarking (DRW, EMNLP Findings 2022) claims 100% detection in lab settings.	Watermarks are removable by paraphrasing, can be spoofed, and degrade output quality if too strong. Frontier labs have not publicly confirmed deploying them at scale.
Behavioural fingerprints	Identity slips ("I'm ChatGPT"), refusal phrasing matching the teacher's style, specific quirks transferred wholesale.	Easily fixed in subsequent fine-tunes. Suggestive, not dispositive.
Stylometric / linguistic analysis	Reasoning trace structure, idiomatic phrasing, error patterns that match the teacher more than the public web.	Models trained on similar web data sound similar by default; baseline confounds the signal.
Output suppression	Return only top-k tokens or hard labels; reasoning tokens hidden by default (OpenAI's o1 hides CoT for exactly this reason). Forces an extractor to do many more queries.	Doesn't prevent distillation, just raises the cost.

The counter-arguments to internalise

Technique vs. consent. The same recipe is "legitimate distillation" when the teacher consents (DeepSeek-R1 → R1-Distill-Qwen) and "alleged theft" when it doesn't. The dispute is policy, not technology.
Frontier labs trained on copyrighted web data. A common rebuttal: every frontier model was trained on data scraped from publishers, authors, and code repositories without case-by-case consent. The same labs that argue API outputs are protected against training-data use have themselves taken expansive positions on training-data rights. Whether this is whataboutism or a substantive parallel depends on one's prior on IP norms.
Imitation has a ceiling. The Berkeley paper and several follow-ups suggest that pure black-box distillation copies surface and falls behind on hard capability. If DeepSeek-R1 actually performs at the level reported, that performance probably isn't entirely attributable to copying — even granting the strongest version of the accusation.
Detection evidence is mostly circumstantial. No public allegation against an open lab has, to date, presented disclosed forensic artefacts — no watermark match, no exfiltrated training-data file. The cases rest on access patterns, behaviour, and capability curves, each of which has innocent alternative explanations.
The asymmetry is real, even granting the rebuttals. A lab that builds a frontier model spends billions on compute, RLHF, and red-teaming. A lab that distills its outputs for a few million spends much less and produces a near-substitute. Whatever one thinks of the IP framing, the economics of that gap are why frontier labs view it as existential, and why the accusations keep getting made.

What this section is for in a study wiki Understanding the technique. You should leave this page knowing exactly how a model would be distilled from a frontier API, why detection is hard, and why the dispute is fundamentally about consent and ToS rather than capability or weights. None of that is an endorsement of any specific allegation in either direction — for the actual facts of specific incidents, follow the linked primary sources and form your own view.

Code-level appendix#

A · The Hermes recursive tool-call loop (annotated)

From NousResearch/Hermes-Function-Calling/functioncall.py. The full loop is <200 lines. Below: the canonical structure.

python · functioncall.py (excerpted)class ModelInference:
    def generate_function_call(self, query, chat_template, num_fewshot, max_depth=5):
        depth = 0
        chat  = [{"role": "user", "content": query + " (first turn; no <tool_results> yet)"}]
        tools  = functions.get_openai_tools()
        prompt = self.prompter.generate_prompt(chat, tools, num_fewshot)
        completion = self.run_inference(prompt)

        def recursive_loop(prompt, completion, depth):
            tool_calls, assistant_msg, err = self.process_completion_and_validate(
                completion, chat_template)
            prompt.append({"role": "assistant", "content": assistant_msg})

            if tool_calls:
                tool_message = f"Agent iteration {depth}..."
                for call in tool_calls:
                    valid, why = validate_function_call_schema(call, tools)
                    if valid:
                        try:
                            resp = self.execute_function_call(call)
                            tool_message += f"<tool_response>\n{resp}\n</tool_response>\n"
                        except Exception as e:
                            tool_message += format_error_for_model(call, e)
                    else:
                        tool_message += format_schema_error(call, why)
                prompt.append({"role": "tool", "content": tool_message})
                depth += 1
                if depth >= max_depth: return
                completion = self.run_inference(prompt)
                recursive_loop(prompt, completion, depth)
            elif err:
                # model produced a malformed tool call; feed the parse error back
                prompt.append({"role": "tool", "content": format_parser_error(err)})
                ...
            else:
                return      # pure content → done

        recursive_loop(prompt, completion, depth)

Three things to notice: the loop feeds parse errors back to the model as tool responses (self-correction), it has a hard max_depth (5) to prevent runaway, and every tool result is wrapped in <tool_response> tags so the next inference sees a clean role-tagged context.

B · mlx-lm streaming generate (shape)

python · mlx-lm API surfacefrom mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("mlx-community/Qwen3.5-9B-4bit")
prompt = tokenizer.apply_chat_template(
    messages=[{"role": "user", "content": "…"}],
    tools=tools_schema,
    add_generation_prompt=True,
)

sampler   = make_sampler(temp=0.7, top_p=0.9)
processors = make_logits_processors(repetition_penalty=1.05)

for response in stream_generate(
    model, tokenizer, prompt,
    max_tokens=2048,
    sampler=sampler,
    logits_processors=processors,
):
    parser.feed(response.text)              # streaming Hermes parser
    yield parser.drain_events()             # content + tool-call deltas

C · MLX lazy evaluation in three lines

python · MLXimport mlx.core as mx

a = mx.random.uniform(shape=(1024, 1024))    # nothing computed yet
b = mx.matmul(a, a) + a                       # still just a graph node
mx.eval(b)                                    # NOW the kernels run, fused

The whole performance story of MLX hinges on this. By the time mx.eval runs, MLX has the full graph and can fuse, reorder, and skip allocations.

Glossary#

Term	Meaning here
Harness	The server-side machinery that wraps a model: prompt building, generation loop, parsing, HTTP. In this stack, Rapid-MLX.
Orchestrator	The client-side loop that decides when to call the model, when to execute tools, and what to do with results. Usually one of Claude Code / Aider / Cursor / your script.
Agent	An LLM + tools + an orchestrator running the loop. "Agentic" = the loop has more than one turn and at least one tool execution.
Chat template	A Jinja file shipped with the model that converts `messages` + `tools` into the exact token string the model was trained on.
Prefill	Processing the prompt tokens to build the KV cache, before the first generated token.
Decode	Generating one token at a time, autoregressively.
TTFT	Time-To-First-Token. Latency between request and first decoded token. Dominated by prefill.
KV cache	Per-layer key/value tensors saved across decode steps so each new token doesn't redo attention over the whole history.
DeltaNet	An RNN-style attention replacement used in Qwen3.5 hybrid models. Stateful; not slice-trimmable like KV.
Speculative decoding	A small "draft" model proposes tokens; the main model verifies them in parallel. 1.5–6× decode speedup when the draft is well-aligned.
MCP	Model Context Protocol — a standard for letting servers advertise tools to clients. Rapid-MLX supports it via `--mcp-config`.

Sources#

Rapid-MLX · raullenchai/Rapid-MLX (GitHub) — README, architecture diagram, parser table, flag reference, benchmark methodology.
MLX · ml-explore/mlx (GitHub) — README enumerating lazy eval, unified memory, dynamic graphs, transforms.
mlx-lm · ml-explore/mlx-lm (GitHub) — load / generate / stream_generate API, samplers & logits processors, prompt cache, KV cache.
Hermes-Function-Calling · NousResearch (GitHub) — system prompt template, tool-call XML format, recursive loop.
functioncall.py — ModelInference class, recursive_loop, execute_function_call, max_depth.
hermes-function-calling-v1 dataset — training data shape that taught the model the format.
waybarrios/vllm-mlx — upstream of Rapid-MLX.
MLX documentation — quick start, transforms, multi-device.
DeepSeek-R1-Distill-Qwen-1.5B (Hugging Face) — distilled student family, model card and config.
Knowledge Distillation Using Frontier Open-source LLMs (arXiv 2410.18588) — Llama-3.1-405B → 8B/70B with synthetic data; the recent reference for black-box distillation.
Hermes 3 — Nous Research — Hermes-3 announcement and training details on Llama 3.1 base models.
Hermes-2-Pro-Llama-3-8B (Hugging Face) — model card: function-calling and JSON-mode dataset, eval scores.
The Complete Guide to DeepSeek Models (BentoML) — methodology summary: 800k samples, 6 students, SFT-only recipe.
Stanford Alpaca — the canonical openly-acknowledged API-distilled small model.
The False Promise of Imitating Proprietary LLMs (UC Berkeley, arXiv 2305.15717) — the empirical case that imitation copies style more readily than capability.
OpenAI vs DeepSeek distillation dispute (Rest of World) — overview of the public allegations and counter-claims.
OpenAI memo to Congress (FDD analysis) — coverage of the February 2026 memo, including the "obfuscated routers" allegation.
Anthropic accuses DeepSeek, Moonshot, MiniMax (CNBC) — public reporting on additional accusations against open labs.
Distillation-Resistant Watermarking (EMNLP Findings 2022) — the technical method behind output watermarking and Fourier-based detection.

Built as a study reference. Nothing in this page is private to any company; every claim should trace back to a public source above. If something here disagrees with what you observe in code, the code is right — file an issue against your own notes.