---
compiled: 2025.12.06
depth: 5 min_read
---

tool calls are structured output

Why tool calling is just JSON prediction and how to build reliable 'agentic' loops.

If you listen to AI marketing, “tool calling” (or function calling) sounds like a major architectural breakthrough. It’s described as a way for the model to “access the real world,” “use your API,” or “execute code.”

It sounds like the model has a tiny hand that reaches out of the terminal and clicks a button.

But if you’ve been following this series, you know the truth: LLMs only move in one direction: forward, one token at a time.

Tool calling isn’t a special “mode.” It’s just a specialized form of structured text generation.

The “Assistant” Magic Trick

When you use a tool-calling API, the model isn’t actually executing anything. It’s playing a game of “Simon says” where your code is Simon.

The “magic” happens in the client-side code (the Python or TypeScript SDK you’re using). The model just predicts the JSON that it thinks Simon wants to see.

The Raw Reality

To understand tool calling, you have to look at the raw bytes coming back from the API. The model doesn’t return a “function call object” in some binary format. It returns text that looks like JSON.

Here is what a raw response might look like when you ask a model to “get the weather in New York”:

raw_api_response.json
{
"id": "chatcmpl-123",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"New York, NY\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}

Notice three things:

  1. content is null. The model stopped generating “human” text.
  2. tool_calls contains the predicted JSON.
  3. finish_reason is "tool_calls".

The model predicted a stop token specifically for tool calls. It didn’t “click” anything. It just stopped talking and handed the JSON to the client.

The Mirror Loop

Building an “agent” is just running a loop that mirrors this structure. It’s a four-step dance:

  1. The Request: You send the model a list of tools (schemas) and a prompt.
  2. The Prediction: The model predicts a tool_call (JSON).
  3. The Execution: Your code parses that JSON, runs a local function, and gets a result.
  4. The Feedback: You send that result back to the model as a new message.

Hallucinating in JSON

Because tool calling is just prediction, models can hallucinate tools just as easily as they hallucinate facts.

In 2026, models are actually quite good at picking the right tool, as long as you keep the number of tools reasonable (e.g., OpenAI has a limit of 128, but accuracy degrades long before that).

The failure mode has shifted. They don’t usually invent a make_sandwich tool when you don’t have one.

Instead, they fail on the arguments.

  • Subtle Types: Predicting a string "true" instead of a boolean true.
  • Hallucinated Parameters: Inventing a date parameter because the user mentioned “tomorrow,” even though your schema only has location.
  • Reasoning-Argument Mismatch: The model “thinks” correctly (e.g., “I need to calculate the mortgage”) but then outputs 0 for the interest rate because it wasn’t in the context.

The Context Tax

There is no free lunch. Every tool you define is injected into the system prompt, consuming your context window.

If you dump your entire API spec (300 endpoints) into the tool definition, you are:

  1. Paying for those tokens on every single request.
  2. Confusing the model with too many choices.
  3. Pushing actual conversation history out of the window.

You Don’t Even Need the API

Before OpenAI and Anthropic released official “function calling” features, we did this manually. We would prompt the model: “If you need to search, output JSON in the format {{ "action": "search", "query": "..." }}.”

In practice, this often requires adding a unique delimiter (like <tool_call>... </tool_call>) to make it easy for your client-side parser to find the “computer parts” in the middle of a “human” sentence.

The official APIs just make this more reliable by training the model on the specific <tool_call> token boundaries and providing better constrained sampling.

The Takeaway

When you are debugging a tool-calling failure, don’t look for “bugs in the model’s logic.” Look for:

  1. Ambiguous Descriptions: Is the model confused about which tool to pick?
  2. Schema Complexity: Is your JSON schema too deep for the model to predict reliably?
  3. Context Overflow: Is the result of the tool call too long, pushing earlier instructions out of the window? (See context is everything you send)

Tool calling is the bridge between the probabilistic world of LLMs and the deterministic world of software. But the bridge is made of text.

Next up: The final post in this series. We’ve spent six posts looking at how LLMs work. Now, we’ll look at why they “fail”, and why hallucination is actually the engine’s core architecture.

Footnotes

  1. See OpenAI’s tool call spec for a common implementation of this contract.