Rapid-MLX, Hermes & MLX — Architecture Study Wikireference
A reference-grade walk-through of how an OpenAI-compatible local LLM server is actually built on Apple Silicon: from MLX kernels at the bottom, through mlx-lm's generation loop, into Rapid-MLX's serving harness, up to the Hermes-style tool-call protocol that lets a model say "please run this function for me."
Overview & scope#
This wiki is a study companion for understanding how a modern, local, tool-capable LLM server actually fits together on Apple Silicon. It treats the five components named in the title — MLX, mlx-lm, Rapid-MLX, the agent / orchestrator loop, and the Hermes tool-call format — as a single layered system, and walks each layer in enough detail that you could plausibly debug or extend any of them.
Who this is for
- Engineers who can read Python and have a rough idea what a transformer does, but haven't traced an inference request end-to-end through a real serving stack.
- Anyone trying to reason about why "tool calling" works at all — what the model actually sees, what the harness actually does, where the contract lives.
- People deciding whether to use, fork, or replace any of these layers and want a map first.
In scope
| Topic | Depth here |
|---|---|
| MLX as an array framework (lazy eval, unified memory, transforms) | Conceptual — enough to know why it's fast, not how to write Metal kernels. |
| mlx-lm's generation loop, samplers, logits processors, KV cache | API surface + the autoregressive loop in pseudocode. |
| Rapid-MLX's serving architecture, prompt cache, parsers, cloud routing | From the README architecture diagram down to flag-level behaviour. |
| The agent loop (server-side vs client-side responsibilities) | Canonical loop in pseudocode + annotated Hermes recursive_loop. |
Hermes tool-call protocol (system prompt, <tool_call>, <tool_response>) | Wire format with examples, plus how it's rendered through the chat template. |
| How "the LLM knows what tools exist" | End-to-end render of a tool schema into the exact tokens the model sees. |
Out of scope (deliberately)
- Training, fine-tuning, LoRA mechanics — mentioned in passing, not explained.
- Metal shader internals, MLX kernel authoring.
- Benchmarking methodology — numbers are quoted from the Rapid-MLX README without re-running them.
- Alternative tool-call formats (Llama, DeepSeek, Harmony, etc.) beyond a table; Hermes is the worked example.
- Production concerns (auth, multi-tenancy, observability) — Rapid-MLX has flags for these; this page doesn't dwell on them.
How to read
Top-to-bottom is the intended path: the mental model and five-layer diagram are the spine, and every later section refers back to them. If you only have ten minutes, read Mental model + End-to-end trace — together they're a complete-enough picture to navigate the rest later. If you're here to verify a specific claim, jump straight to the appendix for source excerpts and Sources for the underlying repos.
The mental model#
Five components, each a thin abstraction over the one below it. The trick to understanding the whole stack is to see that each layer only knows about the one directly under it, and the protocol between an LLM and a tool is just a string contract enforced by careful prompt formatting and careful parsing.
Five layers of the stack#
1 · MLX
Arrays, autograd, Metal kernels. Apple's NumPy/JAX for unified memory.
2 · mlx-lm
LLM weights → tokens. Generation loop, KV cache, samplers, chat templates.
3 · Rapid-MLX
OpenAI-compatible HTTP server. Prompt cache, tool parsers, cloud routing.
4 · Agent loop
The recursion: model → tool call → execute → result → model → …
5 · Hermes format
The XML/JSON contract for tool advertisement and tool calls.
1 · Apple MLX — the foundation#
MLX is an array framework for Apple silicon, built by Apple Machine Learning Research. Think of it as NumPy + autograd + Metal, but designed from day one for the Apple unified-memory architecture instead of being a CUDA framework retrofitted onto a Mac.
The five properties that matter
| Property | What it means | Why an inference engine cares |
|---|---|---|
| Familiar APIs | Python API mirrors NumPy; mlx.nn mirrors PyTorch. | Almost zero porting cost from a PyTorch reference implementation. |
| Lazy computation | Operations build a graph; results materialise only when mx.eval() runs (or a value is read). | Lets MLX fuse kernels, eliminate intermediate allocations, and reorder ops. |
| Dynamic graphs | Graphs are constructed every call; shape changes don't recompile. | Variable-length sequences (the norm in LLM decoding) cost nothing extra. |
| Multi-device | Same array can run on CPU or GPU; no .to(device). | Preprocessing on CPU and attention on GPU share the same buffer. |
| Unified memory | Arrays live in a single shared address space. | No host↔device copy of the KV cache, ever — this is the largest single win for decode-side perf. |
| Composable transforms | grad, vmap, jit compose like JAX. | Same primitive supports training, fine-tuning (LoRA), and inference paths. |
Why unified memory is the whole game
On a discrete-GPU system, model weights live in GPU VRAM and the host RAM holds the request queue, tokenizer state, and KV cache scaffolding. Every iteration shuffles bytes across PCIe. Apple Silicon has one physical memory pool addressable by both CPU and GPU, and MLX exposes that as the architectural primitive — there is literally no "move to GPU" call. The KV cache, which grows linearly with context length, never has to be transferred between devices. This is the structural reason an MLX-native engine can beat a generic Metal-shader engine like llama.cpp's MPS path on most models, even when the latter is highly tuned.
mlx-lm, mlx-vlm, and engines like Rapid-MLX.
2 · mlx-lm — the model runtime#
mlx-lm is a Python package that turns "a folder of weights from Hugging Face" into "a Python function that produces tokens." It is the layer that owns: the tokenizer, the chat template, the autoregressive loop, the KV cache, and the sampler.
Three functions are the whole API
pythonfrom mlx_lm import load, generate, stream_generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
messages = [{"role": "user", "content": "Write a story about Einstein"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
# one-shot
text = generate(model, tokenizer, prompt=prompt, verbose=True)
# streaming
for response in stream_generate(model, tokenizer, prompt, max_tokens=512):
print(response.text, end="", flush=True)
The autoregressive loop, in spirit
Every LLM inference engine is some variation of this loop. mlx-lm's implementation is a clean reference version of it.
python · pseudocodedef generate(model, tokenizer, prompt, max_tokens, sampler, logits_processors):
tokens = tokenizer.encode(prompt)
kv_cache = make_cache(model) # layer-wise KV buffers
# PREFILL: process the whole prompt in chunks of --prefill-step-size
for chunk in chunked(tokens, prefill_step_size):
logits = model(chunk, cache=kv_cache)
# DECODE: one token at a time, until EOS or limit
for _ in range(max_tokens):
logits = model(tokens[-1:], cache=kv_cache) # shape: [1, 1, vocab]
for proc in logits_processors:
logits = proc(tokens, logits) # e.g. tool logits bias
next_token = sampler(logits) # temp / top-p / argmax
if next_token == tokenizer.eos_token_id: break
yield next_token
tokens.append(next_token)
The pieces, named
tokenizer.apply_chat_template(messages, …)— turns a list of role/content dicts (and optionally atools=…argument) into a single token string the model was trained to recognise. The template is a Jinja file shipped with the model;add_generation_prompt=Trueappends the "assistant:" preamble.- Sampler — any callable
(logits) → token. Temperature, top-p, top-k, min-p, argmax.mlx_lm.sample_utilsships standard ones. - Logits processors — an ordered list of
(history, logits) → logits. Repetition penalty lives here, and so does Rapid-MLX's tool logits bias, which nudges the model toward structured tokens like<tool_call>at the moment it should be opening one. - KV cache — per-layer key/value tensors that grow with sequence length. mlx-lm supports a rotating fixed-size cache (
--max-kv-size) for long generations. - Prompt cache — serialise the KV state to disk (
mlx_lm.cache_prompt) so a long system prompt only gets prefilled once. This is the seed Rapid-MLX builds its in-memory prompt cache on.
3 · Rapid-MLX — the serving harness#
Rapid-MLX (raullenchai/Rapid-MLX) is a fork of waybarrios/vllm-mlx that wraps mlx-lm and mlx-vlm in an OpenAI-compatible HTTP server, then aggressively layers performance and reliability tricks on top. The package directory is vllm_mlx/.
The architecture, from the README
What Rapid-MLX owns that mlx-lm doesn't
| Concern | Implementation |
|---|---|
| HTTP surface | FastAPI, OpenAI /v1/chat/completions, /v1/models, streaming SSE. |
| Persistent state | In-memory prompt cache keyed by message prefix; restored across requests. |
| Tool-call extraction | 17 parsers, one per model family (hermes, llama, deepseek, harmony, kimi, glm47, minimax, …). Auto-selected from model name. |
| Reasoning extraction | Separate parsers for <think>-style chain-of-thought, surfaced as reasoning_content (never mixed into content). |
| Recovery | If a 4-bit quantized model emits a malformed tool call as plain text, the parser auto-converts it back to structured tool_calls JSON. |
| Routing | If new_tokens > --cloud-threshold, the request is shipped to a cloud LLM via litellm instead of running locally. |
| Streaming hygiene | Think-tag filter, chunk-boundary leak fix, developer role normalisation, disconnect guard. |
SimpleEngine: the heart of the server
SimpleEngine is the boundary class. It accepts an OpenAI chat-completion request, decides whether to use the cache, runs the mlx-lm generation loop with the right logits processors and sampler, and emits a stream of tokens that the parser layer turns back into a structured response. Everything else — vision, audio, embeddings — sits beside SimpleEngine as a sibling and is dispatched by route.
Qwen3.5-* → hermes + qwen3 reasoning. DeepSeek-R1 → deepseek + deepseek_r1 reasoning. GPT-OSS → harmony. Explicit --tool-call-parser always overrides. Hermes is the most widely compatible format, so Mistral, Devstral, Gemma, Phi-3/4 all use it.
Server flags that change behaviour at runtime
| Flag | What it does | When you turn it on |
|---|---|---|
--enable-tool-logits-bias | Logits processor that biases toward structured tokens (e.g. <tool_call> opener) once a tool call is detected starting. | Speed + reliability of tool-emitting models. |
--prefill-step-size | Tokens processed per prefill chunk (default 2048). | Larger = faster cold start, more peak memory. |
--kv-bits 4|8 | Quantize the KV cache. | Long contexts on small memory budgets. |
--draft-model | Speculative decoding draft model. | 2× decode boost on compatible model pairs. |
--cloud-model + --cloud-threshold | Spill long-context requests to a cloud LLM. | You want fast latency on small chats and large-context fall-through. |
--mcp-config | Wire in an external Model Context Protocol tool catalog. | Letting the server itself surface tools to clients. |
4 · The agent loop (orchestrator)#
"Agent" and "orchestrator" are overloaded words. In this stack they have two distinct meanings depending on which side of the API you stand on. Untangling them is half the battle.
| Server-side loop (Rapid-MLX) | Client-side loop (Claude Code, Aider, …) | |
|---|---|---|
| Owns | Token sampling, parser, streaming, prompt cache. | Tool schemas, tool execution, multi-turn planning, user UI. |
| Inputs | OpenAI chat request (messages + tools). | User prompt + filesystem + git + shell. |
| Outputs | Structured tool_calls or final content. | Edits, diffs, runs, follow-up messages. |
| Loop trigger | Each HTTP call is one model turn. | If response contains tool_calls → execute → re-call server. |
The Rapid-MLX server is stateless per turn. It receives the whole transcript every time, runs the model once, returns either content or tool_calls, and forgets. The orchestrator is whichever client is driving — Claude Code, Cursor, Aider, OpenCode, or your own script. This is why "drop-in OpenAI replacement" works: the client already knows how to run the agent loop against any OpenAI-compatible endpoint.
The canonical agent loop, on either side
python · pseudocodedef agent_loop(user_query, tools, max_depth=5):
messages = [
{"role": "system", "content": system_prompt_with_tools(tools)},
{"role": "user", "content": user_query},
]
for step in range(max_depth):
resp = openai_chat_completion(messages=messages, tools=tools)
msg = resp.choices[0].message
messages.append(msg) # assistant turn
if not msg.tool_calls:
return msg.content # done
for call in msg.tool_calls:
result = dispatch(call.name, call.arguments) # execute
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})
raise MaxDepthExceeded()
recursive_loop and lives inside generate_function_call. Each level either (a) finds tool calls, executes them, appends a <tool_response> turn, and recurses, or (b) decides the model is done. Max depth defaults to 5. The annotated source is in the appendix.
5 · The Hermes tool-call format#
"Hermes format" is the protocol developed by Nous Research for their Hermes-2-Pro / Hermes-3 models. It is a conventional protocol — there's no magic, just a system prompt and two XML tags that the model is trained to respect. Rapid-MLX uses this format as its default for Qwen, Mistral, Devstral, Gemma, and Phi-3/4 because they all tolerate it well.
The three pieces of the contract
1. The system prompt
Tells the model that it is a function-calling agent and lists every available tool. Tools are serialised as JSON Schema-flavoured signatures inside a <tools> tag.
text · system messageYou are a function calling AI model. You are provided with function
signatures within <tools></tools> XML tags. You may call one or more
functions to assist with the user query. Don't make assumptions about
what values to plug into functions.
<tools>
{"type": "function", "function": {
"name": "get_stock_price",
"description": "Get the current stock price for a ticker symbol",
"parameters": {
"type": "object",
"properties": {"symbol": {"type": "string"}},
"required": ["symbol"]
}
}}
</tools>
For each function call return a json object with function name and
arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>{"name": "<function-name>", "arguments": <args-dict>}</tool_call>
2. The assistant's tool call
text · assistant messageI'll look that up for you.
<tool_call>
{"name": "get_stock_price", "arguments": {"symbol": "TSLA"}}
</tool_call>
3. The tool's response, fed back as the next turn
text · tool message<tool_response>
{"name": "get_stock_price", "content": {"symbol": "TSLA", "price": 312.04}}
</tool_response>
Why XML tags around JSON?
Two reasons. First: greppability. <tool_call>…</tool_call> is trivially findable by streaming parsers even mid-token. Second: state machine clarity. The model sees a clear "I am now in tool-call mode" boundary, which empirically helps small/quantized models stay structured. JSON inside gives the args their type discipline.
tool_call JSON without the surrounding tags, or with attribute keys subtly wrong. Rapid-MLX's "auto tool recovery" pass catches these — pattern-matching the JSON-shaped chunk in the model's plain-text content and reconstructing the structured tool-call envelope before returning to the client. Per the README, this is what gets quantized Qwen3.5 to 100% tool-call success.
Multiple calls per turn, parallel tools
The model may emit several <tool_call> blocks in a single assistant turn — the orchestrator should execute all of them and return all <tool_response> blocks in the next user turn. Modern Hermes-trained models handle this natively.
How the LLM knows what it can call#
This is the section the title of the wiki is really about. There is no magic introspection — the model sees only the tokens you give it. So "what the harness offers" is a function of three rendering decisions:
- Schema rendering — the orchestrator (or the chat template) renders each tool's JSON Schema into the system prompt, inside
<tools>…</tools>. - Template binding — the chat template (Jinja, shipped with the model) decides exactly how the tools block is interleaved with the system instructions and user messages. Most Hermes-trained chat templates accept a
tools=…kwarg toapply_chat_template. - Training — the model has been fine-tuned on conversations that follow this exact format, so it has learned to (a) emit a
<tool_call>block when calling a tool, (b) wait for a<tool_response>, (c) emit normal content when answering.
Walk-through: a single tool, end-to-end
python · client# 1) Client describes the tool in OpenAI form
tools = [{
"type": "function",
"function": {
"name": "read_file",
"description": "Read a file from disk.",
"parameters": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"],
},
},
}]
resp = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What's in README.md?"}],
tools=tools,
)
text · what the model actually sees (after chat template)<|im_start|>system
You are a function calling AI model. ...
<tools>
{"type":"function","function":{"name":"read_file","description":"Read a file from disk.","parameters":{"type":"object","properties":{"path":{"type":"string"}},"required":["path"]}}}
</tools>
For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags ...
<|im_end|>
<|im_start|>user
What's in README.md?<|im_end|>
<|im_start|>assistant
text · what the model emits<tool_call>
{"name": "read_file", "arguments": {"path": "README.md"}}
</tool_call><|im_end|>
json · what Rapid-MLX returns to the client{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "read_file",
"arguments": "{\"path\": \"README.md\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}
<tool_call> in the stream and switch parser state. (3) Reshape the model's output into OpenAI's tool_calls JSON before returning. Rapid-MLX does all three; the model just produces tokens.
End-to-end trace#
One full round of "user asks → tool gets called → user gets answer," with every layer's responsibility labelled.
Parsers & recovery in detail#
Tool-call parsers are the most subtle part of the server. They run as a streaming state machine over the decoded tokens, and they're the only thing between "model emitted text" and "client receives structured JSON." There are 17 of them in Rapid-MLX, one per major model family.
| Parser | Native format | Models |
|---|---|---|
hermes | <tool_call>{json}</tool_call> | Qwen3.5, Mistral, Devstral, Gemma, Phi-3/4, Hermes-3 |
llama | JSON only, often {"name": ..., "parameters": ...} | Llama 3.x |
deepseek / deepseek_v31 | Family-specific JSON wrappers | DeepSeek V2.5, V3, V3.1, R1 |
harmony | OpenAI's open-weight Harmony channel format | GPT-OSS |
minimax | XML-flavoured tool format | MiniMax-M2.5 |
glm47 | GLM-family tool format | GLM-4.7 |
kimi | Kimi-Linear tool format | Kimi-Linear |
The state machine
python · pseudocode (hermes)class HermesParser:
OPEN = "<tool_call>"
CLOSE = "</tool_call>"
def __init__(self, tools_schema):
self.state = "content" # content | in_call
self.buf = []
self.content = []
self.calls = []
self.schema = tools_schema
def feed(self, token_text):
if self.state == "content":
if looks_like_open(token_text, self.OPEN):
self.state = "in_call"
return []
self.content.append(token_text)
return [stream_event("content", token_text)]
else: # in_call
self.buf.append(token_text)
if self.CLOSE in "".join(self.buf):
raw = self.buf_until_close()
call = recover_json(raw) # tolerant parse
if validate(call, self.schema):
self.calls.append(call)
self.state = "content"; self.buf = []
return []
Auto-recovery — what "100% tool calling" actually means
The recovery pass runs after generation if the model produced something that smells like a tool call but didn't conform. Patterns it handles:
- Missing opening tag:
{"name":"x","arguments":{...}}emitted as plain content → wrap with<tool_call>. - Markdown-fenced JSON:
```json\n{...}\n```with no XML at all → extract and structure. - Truncated close tag:
<tool_call>{json}</tool(EOS hit early) → close synthetically if JSON is valid. parametersvsargumentskey drift → normalise toargumentsfor OpenAI compat.
This is why a 4-bit quantized Qwen3.5 model can hit 100% tool-call success in Rapid-MLX's evals — the model occasionally fumbles the formatting, the parser silently fixes it, and the client never sees the mess.
Performance techniques#
The README lists nine optimisation techniques. Three of them are conceptually interesting enough to study; the rest are configuration knobs.
Prompt cache · KV trim
For a standard transformer, the KV cache at position n only depends on tokens 0..n. So if turn 2 starts with the same 10,000 tokens of system+history as turn 1, you can literally reuse the KV cache from turn 1 and only prefill the new suffix. Rapid-MLX hashes the message prefix and trims its in-memory cache to the longest common prefix. README claims 2–5× faster TTFT.
DeltaNet state snapshots
Qwen3.5 uses Gated DeltaNet (an RNN-style layer) for 75% of its layers and full attention for the other 25%. RNN state isn't "trimmable" the way KV is — you can't slice off the last k rows because each step depends on all prior steps. Rapid-MLX's trick: deep-copy the RNN state at the system-prompt boundary the first time you see it, and on subsequent requests, restore the snapshot in ~0.1 ms instead of re-running hundreds of tokens through the recurrent path. README reports 1.5–4.8× TTFT speedup on Qwen3.5 variants — it's the first prompt-cache implementation for hybrid RNN architectures on MLX.
Tool logits bias (jump-forward decoding)
Once the parser detects the model is starting a <tool_call> sequence, it knows the next several tokens must be the opening JSON structure. A logits processor biases those tokens upward — or in the limit, force-decodes them — skipping samples for tokens whose value is already determined by the format. This is both a speedup and a reliability win (the structure can't go wrong).
How open models learn — distillation#
A short detour. The rest of the wiki is about running models; this section is about how the specific models Rapid-MLX serves — DeepSeek-R1 distilled variants, Hermes-3-Llama, Qwen3.5, GPT-OSS — got to be small, fast, and good. Distillation is the single most important reason a 7B model on your laptop can hold its own against a 70B model from last year.
Three flavours, in order of "openness"
| Flavour | What the student sees from the teacher | Requires |
|---|---|---|
| White-box (logit) distillation | Full output distribution per token, often via KL-divergence loss against a temperature-softened teacher. | Teacher weights or at least logits exposed. |
| Feature distillation | Hidden-state matching: align student layer activations to teacher layer activations. | Teacher weights and architectural compatibility. |
| Black-box (response) distillation | Only the teacher's sampled outputs — text completions, sometimes with reasoning chains. | Only an API. Works against closed models. |
For open models in 2026 the dominant flavour is black-box distillation on synthetic data, usually augmented with the teacher's chain-of-thought traces. It's cheap (one inference pass per training sample), it works against any teacher you can prompt, and the resulting supervision is high-signal because the student is essentially learning to imitate a strong policy.
The recipe in pseudocode
python · pseudocode# 1) Curate diverse prompts that cover the capability you want.
prompts = load_prompts(domains=["math", "code", "tool-use", "chat", ...])
# 2) Run the TEACHER over every prompt. Keep reasoning + final answer.
teacher = load("big-frontier-model")
samples = []
for p in prompts:
out = teacher.generate(p, enable_thinking=True)
if verify(out): # reject unsupported answers
samples.append({"prompt": p,
"reasoning": out.cot,
"answer": out.final})
# 3) Supervised fine-tuning on the STUDENT.
student = load("qwen2.5-7b") # or llama-3.1-8b, etc.
sft_train(student, samples, loss="crossentropy")
# 4) (Optional) add KL term against teacher logits if available.
# 5) Optionally: RLHF / DPO on top. DeepSeek-R1 distillation skipped this.
Three details matter more than they sound:
- Verification before training. Synthetic data is only useful if it's right. The recent generation of distillation pipelines all add some form of reject-sampling — run the teacher many times, keep only outputs that pass a verifier (a checker for math, a compiler for code, a function-call schema check for tool use).
- Teach the reasoning, not just the answer. Including the teacher's chain-of-thought in the training data — not just the final token — is what transfers the capability, not just the answer for one prompt. This is the DeepSeek-R1 finding in one sentence.
- Mix the data. Pure distillation on one capability erodes the others. Real recipes mix general-purpose chat data with the capability you're targeting.
Case study · DeepSeek-R1 distillation
The most influential recent example, and a clean demonstration of why distillation works. DeepSeek took its large RL-trained R1 reasoning model as the teacher, started from six open-source base students (Llama-3.1 8B and 70B, Llama-3.3, Qwen-2.5 1.5B / 7B / 14B / 32B), generated ~800,000 high-quality reasoning traces from R1, and supervised-fine-tuned the students on those traces. No RL on the students. The released family — DeepSeek-R1-Distill-Qwen-{1.5B, 7B, 14B, 32B} and DeepSeek-R1-Distill-Llama-{8B, 70B} — is what Rapid-MLX, Ollama, and llama.cpp actually run when someone says "I'm using R1 locally."
Case study · Hermes 2 Pro / Hermes 3
Nous Research's Hermes line is the worked example of capability-targeted distillation. Hermes 2 Pro and Hermes 3 are built on Llama 3.1 (8B / 70B / 405B) and trained primarily on synthetically generated responses. The function-calling capability covered earlier in this wiki — the <tool_call> / <tool_response> protocol — was instilled by the openly released hermes-function-calling-v1 dataset: a mix of single-turn and multi-turn function-calling conversations, JSON-mode samples, agentic JSON-mode, and structured extraction. The Hermes 2 Pro reports show 90% on a function-calling eval built with Fireworks.AI and 84% on structured JSON output.
The lesson: the tool-call protocol described in this wiki only works because the model was trained on a dataset that uses it. The XML tags, the JSON shape, the multi-turn convention — none of it would be reliable if the model hadn't seen thousands of correctly-formatted examples during fine-tuning. Hermes is the canonical "how you teach a model to call tools" recipe, and it's why so many other model families (Mistral, Devstral, Gemma, Phi-3/4) work with the same parser in Rapid-MLX.
Brief mentions · Qwen and GPT-OSS
- Qwen / Qwen3.5 — Alibaba's series uses a heavy synthetic-data + multi-stage post-training pipeline. The "thinking" variants (Qwen3.5-A3B and similar) emit reasoning blocks before answers; the chat template handles
enable_thinkingas a flag. Distillation from larger Qwen teachers to smaller Qwen students is part of how the small variants stay competitive. - GPT-OSS — OpenAI's open-weight release uses the Harmony channel format and is itself a distilled student of a larger internal teacher. The Harmony parser in Rapid-MLX is the runtime counterpart to that training format.
Why this matters for Rapid-MLX (the link back)
Almost every model Rapid-MLX serves on consumer hardware is the product of a distillation pipeline followed by quantization. Two specific connections worth holding in mind:
- Compounding compression. A 4-bit
DeepSeek-R1-Distill-Qwen-7Brunning in Rapid-MLX has been compressed twice: from 671B → 7B (distillation) and then from FP16 → INT4 (quantization). Most of the user-visible quality loss is from the first step; quantization is comparatively cheap, which is what makes "fit a frontier-quality model in 16 GB" plausible at all. - Speculative decoding's draft model is usually a distilled sibling. When you pass
--draft-modelto Rapid-MLX, the right choice is almost always a small distilled variant of the same family (e.g. Qwen3.5-1.5B drafting for Qwen3.5-9B). Distillation gives the draft and target models similar token preferences, which is exactly what raises the acceptance rate that makes speculative decoding pay off.
--cloud-model) earns its keep.
The contested side — frontier-lab "weight theft" claims#
A separate but adjacent topic. Several frontier labs (OpenAI, Anthropic, Microsoft) have publicly alleged that open-model labs — DeepSeek, Moonshot, MiniMax among the named — trained their models by distilling from frontier APIs in violation of those APIs' terms of service. This section walks through what's being alleged, how it would technically be accomplished, and what the public counter-arguments are. Everything below is presented as claims and disputes, not as established fact — the accused parties contest the accusations, and as a study reference this page deliberately stays balanced.
Vocabulary check
- "Weight theft" is a misnomer. No-one alleges that DeepSeek extracted the literal floating-point parameters of GPT-4 or Claude. Frontier weights have never been exposed; they couldn't be copied. What's alleged is behavioural theft: capturing the model's outputs at scale and training a student on them, so that the student inherits the teacher's behaviour without inheriting its weights.
- Distillation vs. model extraction. Academic "model extraction attacks" try to recover weights or a near-functional clone of a classifier from queries. LLM distillation is different — the goal isn't weight recovery, it's capability transfer.
- ToS violation ≠ legal violation. Whether breaching an API's terms of service rises to misappropriation, copyright infringement, or trade-secret theft is jurisdictionally unsettled and actively litigated.
How it would technically be accomplished
The same recipe as the previous section, applied without the teacher's permission. Stripped to its core:
python · pseudocode# 1) Acquire API access at scale. Often via proxy networks, reseller
# accounts, or third-party routers (OpenAI's memo to Congress alleges
# DeepSeek used "obfuscated routers" to circumvent access controls).
clients = pool_of_api_keys(via="intermediaries")
# 2) Generate diverse, capability-targeted prompts.
# Often a smaller open model produces the prompts to multiply scale.
prompts = synth_prompts(seed=human_curated, expand_with="open-7b-model")
# 3) Query the frontier API at scale; capture outputs.
# Reasoning models (o1, R1) expose chain-of-thought in some surfaces;
# capturing that CoT is what makes the distilled student strong.
samples = []
for p in prompts:
r = clients.chat.completions.create(model="frontier", messages=p, ...)
samples.append({"prompt": p,
"reasoning": r.message.reasoning_content,
"answer": r.message.content})
# 4) Verify / reject-sample. Math checked symbolically, code by execution,
# function-calls by schema validation.
samples = [s for s in samples if verify(s)]
# 5) SFT a smaller OPEN base model on the harvested data.
student = load("llama-3.1-8b-base") # or qwen, etc.
sft_train(student, samples)
# 6) Release the student weights as "open source." Without disclosure of
# where the training data came from, an audit can only infer it from
# behavioural tells.
The thing to internalise: steps 1, 3, and 6 are the only steps that distinguish this from a legitimate research recipe. Steps 2, 4, and 5 are identical to how the openly-distributed DeepSeek-R1-Distill family was made (with R1 as the consenting teacher). The whole legal/ethical dispute is compressed into "who gave permission for step 3, and was step 1 obtained honestly."
Public examples for study
Stanford Alpaca (March 2023) — openly acknowledged
The seminal worked example. Stanford fine-tuned LLaMA-7B on 52,000 instruction-following examples generated by OpenAI's text-davinci-003, using the Self-Instruct prompt-expansion method. Total cost reportedly under $600. Stanford was transparent about the methodology and explicitly noted that the resulting weights couldn't be released for commercial use because of OpenAI's terms. Capability was "comparable to GPT-3.5 on many tasks." This is the canonical "API-distilled small model" recipe; everything since is a variation.
Vicuna (UC Berkeley / CMU / Stanford / UCSD, 2023) — openly acknowledged
LLaMA fine-tuned on ~70,000 user-shared ChatGPT conversations scraped from ShareGPT. Same general pattern as Alpaca, more data, more conversational. Again, methodology was published openly; the release skirted ToS by framing the work as research, not commercial deployment.
The Berkeley "False Promise of Imitating Proprietary LLMs" paper (2023) — the skeptical counterweight
A widely-cited UC Berkeley paper that trained imitation models and evaluated them carefully. The headline finding: imitation models match the style of the teacher (tone, formatting, refusal patterns) far more easily than they match the capability. On hard benchmarks, the gap stays large. This is the empirical reason to be skeptical of the strongest version of the "DeepSeek just copied OpenAI" framing — if pure imitation hit a capability ceiling in 2023, the explanation for R1's actual benchmark performance has to involve more than copying.
OpenAI / Microsoft → DeepSeek (January 2025 onward) — contested
Shortly after the DeepSeek-R1 launch in January 2025, OpenAI and Microsoft publicly alleged that R1 had been trained in part on ChatGPT/o1 outputs obtained via distillation. Microsoft's security team reportedly observed unusual bulk-extraction patterns on OpenAI infrastructure tied to accounts associated with DeepSeek. In February 2026 OpenAI escalated by submitting a memo to the U.S. Congress China Select Committee alleging continued violations, including the use of "obfuscated routers" to bypass access controls. The cited evidence: stylistic resemblance between R1's reasoning traces and o1, performance trajectories the memo describes as inconsistent with pure-from-scratch training, and the API access logs above. DeepSeek has not publicly conceded the allegations; the case rests on circumstantial evidence and ToS arguments rather than seized training data.
Anthropic → DeepSeek, Moonshot, MiniMax (February 2026) — contested
Per CNBC reporting, Anthropic has alleged similar distillation activity by DeepSeek, Moonshot, and MiniMax against Claude. The accused labs dispute these characterisations. As with the OpenAI case, the public record consists of claims rather than disclosed forensic detail.
"Model claims to be ChatGPT" — the most informal evidence
The lightest-touch behavioural fingerprint: several open and quasi-open models from 2023–2025 would, when asked, identify themselves as ChatGPT, GPT-4, or similar — a strong indicator that ChatGPT-style outputs appeared in their fine-tuning data with the assistant identifying itself by name. It's not proof of large-scale unauthorised distillation by itself, but it's the kind of artefact that gets pointed at in the discourse.
Detection — how labs argue they can tell
| Method | What it can show | What it can't |
|---|---|---|
| API access auditing | Patterns of bulk querying, suspicious account chaining, IP forensics. Microsoft's case against DeepSeek-affiliated accounts is reportedly built here. | The data left over after the queries — the trained student — can't be tied back to specific calls. |
| Output watermarking | Embed a statistical signal in the teacher's token probabilities (e.g. sinusoidal perturbations detectable by Fourier transform of the suspect model's outputs). Distillation-Resistant Watermarking (DRW, EMNLP Findings 2022) claims 100% detection in lab settings. | Watermarks are removable by paraphrasing, can be spoofed, and degrade output quality if too strong. Frontier labs have not publicly confirmed deploying them at scale. |
| Behavioural fingerprints | Identity slips ("I'm ChatGPT"), refusal phrasing matching the teacher's style, specific quirks transferred wholesale. | Easily fixed in subsequent fine-tunes. Suggestive, not dispositive. |
| Stylometric / linguistic analysis | Reasoning trace structure, idiomatic phrasing, error patterns that match the teacher more than the public web. | Models trained on similar web data sound similar by default; baseline confounds the signal. |
| Output suppression | Return only top-k tokens or hard labels; reasoning tokens hidden by default (OpenAI's o1 hides CoT for exactly this reason). Forces an extractor to do many more queries. | Doesn't prevent distillation, just raises the cost. |
The counter-arguments to internalise
- Technique vs. consent. The same recipe is "legitimate distillation" when the teacher consents (DeepSeek-R1 → R1-Distill-Qwen) and "alleged theft" when it doesn't. The dispute is policy, not technology.
- Frontier labs trained on copyrighted web data. A common rebuttal: every frontier model was trained on data scraped from publishers, authors, and code repositories without case-by-case consent. The same labs that argue API outputs are protected against training-data use have themselves taken expansive positions on training-data rights. Whether this is whataboutism or a substantive parallel depends on one's prior on IP norms.
- Imitation has a ceiling. The Berkeley paper and several follow-ups suggest that pure black-box distillation copies surface and falls behind on hard capability. If DeepSeek-R1 actually performs at the level reported, that performance probably isn't entirely attributable to copying — even granting the strongest version of the accusation.
- Detection evidence is mostly circumstantial. No public allegation against an open lab has, to date, presented disclosed forensic artefacts — no watermark match, no exfiltrated training-data file. The cases rest on access patterns, behaviour, and capability curves, each of which has innocent alternative explanations.
- The asymmetry is real, even granting the rebuttals. A lab that builds a frontier model spends billions on compute, RLHF, and red-teaming. A lab that distills its outputs for a few million spends much less and produces a near-substitute. Whatever one thinks of the IP framing, the economics of that gap are why frontier labs view it as existential, and why the accusations keep getting made.
Code-level appendix#
A · The Hermes recursive tool-call loop (annotated)
From NousResearch/Hermes-Function-Calling/functioncall.py. The full loop is <200 lines. Below: the canonical structure.
python · functioncall.py (excerpted)class ModelInference:
def generate_function_call(self, query, chat_template, num_fewshot, max_depth=5):
depth = 0
chat = [{"role": "user", "content": query + " (first turn; no <tool_results> yet)"}]
tools = functions.get_openai_tools()
prompt = self.prompter.generate_prompt(chat, tools, num_fewshot)
completion = self.run_inference(prompt)
def recursive_loop(prompt, completion, depth):
tool_calls, assistant_msg, err = self.process_completion_and_validate(
completion, chat_template)
prompt.append({"role": "assistant", "content": assistant_msg})
if tool_calls:
tool_message = f"Agent iteration {depth}..."
for call in tool_calls:
valid, why = validate_function_call_schema(call, tools)
if valid:
try:
resp = self.execute_function_call(call)
tool_message += f"<tool_response>\n{resp}\n</tool_response>\n"
except Exception as e:
tool_message += format_error_for_model(call, e)
else:
tool_message += format_schema_error(call, why)
prompt.append({"role": "tool", "content": tool_message})
depth += 1
if depth >= max_depth: return
completion = self.run_inference(prompt)
recursive_loop(prompt, completion, depth)
elif err:
# model produced a malformed tool call; feed the parse error back
prompt.append({"role": "tool", "content": format_parser_error(err)})
...
else:
return # pure content → done
recursive_loop(prompt, completion, depth)
Three things to notice: the loop feeds parse errors back to the model as tool responses (self-correction), it has a hard max_depth (5) to prevent runaway, and every tool result is wrapped in <tool_response> tags so the next inference sees a clean role-tagged context.
B · mlx-lm streaming generate (shape)
python · mlx-lm API surfacefrom mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors
model, tokenizer = load("mlx-community/Qwen3.5-9B-4bit")
prompt = tokenizer.apply_chat_template(
messages=[{"role": "user", "content": "…"}],
tools=tools_schema,
add_generation_prompt=True,
)
sampler = make_sampler(temp=0.7, top_p=0.9)
processors = make_logits_processors(repetition_penalty=1.05)
for response in stream_generate(
model, tokenizer, prompt,
max_tokens=2048,
sampler=sampler,
logits_processors=processors,
):
parser.feed(response.text) # streaming Hermes parser
yield parser.drain_events() # content + tool-call deltas
C · MLX lazy evaluation in three lines
python · MLXimport mlx.core as mx
a = mx.random.uniform(shape=(1024, 1024)) # nothing computed yet
b = mx.matmul(a, a) + a # still just a graph node
mx.eval(b) # NOW the kernels run, fused
The whole performance story of MLX hinges on this. By the time mx.eval runs, MLX has the full graph and can fuse, reorder, and skip allocations.
Glossary#
| Term | Meaning here |
|---|---|
| Harness | The server-side machinery that wraps a model: prompt building, generation loop, parsing, HTTP. In this stack, Rapid-MLX. |
| Orchestrator | The client-side loop that decides when to call the model, when to execute tools, and what to do with results. Usually one of Claude Code / Aider / Cursor / your script. |
| Agent | An LLM + tools + an orchestrator running the loop. "Agentic" = the loop has more than one turn and at least one tool execution. |
| Chat template | A Jinja file shipped with the model that converts messages + tools into the exact token string the model was trained on. |
| Prefill | Processing the prompt tokens to build the KV cache, before the first generated token. |
| Decode | Generating one token at a time, autoregressively. |
| TTFT | Time-To-First-Token. Latency between request and first decoded token. Dominated by prefill. |
| KV cache | Per-layer key/value tensors saved across decode steps so each new token doesn't redo attention over the whole history. |
| DeltaNet | An RNN-style attention replacement used in Qwen3.5 hybrid models. Stateful; not slice-trimmable like KV. |
| Speculative decoding | A small "draft" model proposes tokens; the main model verifies them in parallel. 1.5–6× decode speedup when the draft is well-aligned. |
| MCP | Model Context Protocol — a standard for letting servers advertise tools to clients. Rapid-MLX supports it via --mcp-config. |
Sources#
- Rapid-MLX · raullenchai/Rapid-MLX (GitHub) — README, architecture diagram, parser table, flag reference, benchmark methodology.
- MLX · ml-explore/mlx (GitHub) — README enumerating lazy eval, unified memory, dynamic graphs, transforms.
- mlx-lm · ml-explore/mlx-lm (GitHub) —
load/generate/stream_generateAPI, samplers & logits processors, prompt cache, KV cache. - Hermes-Function-Calling · NousResearch (GitHub) — system prompt template, tool-call XML format, recursive loop.
- functioncall.py — ModelInference class,
recursive_loop,execute_function_call, max_depth. - hermes-function-calling-v1 dataset — training data shape that taught the model the format.
- waybarrios/vllm-mlx — upstream of Rapid-MLX.
- MLX documentation — quick start, transforms, multi-device.
- DeepSeek-R1-Distill-Qwen-1.5B (Hugging Face) — distilled student family, model card and config.
- Knowledge Distillation Using Frontier Open-source LLMs (arXiv 2410.18588) — Llama-3.1-405B → 8B/70B with synthetic data; the recent reference for black-box distillation.
- Hermes 3 — Nous Research — Hermes-3 announcement and training details on Llama 3.1 base models.
- Hermes-2-Pro-Llama-3-8B (Hugging Face) — model card: function-calling and JSON-mode dataset, eval scores.
- The Complete Guide to DeepSeek Models (BentoML) — methodology summary: 800k samples, 6 students, SFT-only recipe.
- Stanford Alpaca — the canonical openly-acknowledged API-distilled small model.
- The False Promise of Imitating Proprietary LLMs (UC Berkeley, arXiv 2305.15717) — the empirical case that imitation copies style more readily than capability.
- OpenAI vs DeepSeek distillation dispute (Rest of World) — overview of the public allegations and counter-claims.
- OpenAI memo to Congress (FDD analysis) — coverage of the February 2026 memo, including the "obfuscated routers" allegation.
- Anthropic accuses DeepSeek, Moonshot, MiniMax (CNBC) — public reporting on additional accusations against open labs.
- Distillation-Resistant Watermarking (EMNLP Findings 2022) — the technical method behind output watermarking and Fourier-based detection.
Built as a study reference. Nothing in this page is private to any company; every claim should trace back to a public source above. If something here disagrees with what you observe in code, the code is right — file an issue against your own notes.