LLM APIs in Practice

Chat completions, streaming, tokens, prompt caching.

An LLM API is just a fancy HTTP endpoint. You send it some text, it sends back some text. That's the whole magic trick. Everything else (chat history, streaming, tool use, caching) is just plumbing around that one idea. Let's build the right mental model so the docs stop feeling like wizardry.

One-shot vs chat completions

The earliest LLM APIs were completions: you sent a raw prompt string and got a continuation back. Modern APIs are chat completions: you send an array of messages, each tagged with a role.

systemsets the personality and rules ("You are a helpful tutor. Always answer in two sentences.").
user is what the human said.
assistant is what the model previously said. You include past assistant turns so the model remembers the conversation.

The model has no memory between requests. You pass the whole transcript every time. The LLM is a pure function: same input, same output (give or take a sampling seed).

basic-chat.ts

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const res = await client.messages.create({
  model: "claude-opus-4-5",
  max_tokens: 1024,
  system: "You are a friendly JavaScript tutor.",
  messages: [
    { role: "user", content: "What is hoisting in one sentence?" },
  ],
});

console.log(res.content[0].text);

Roles, not authors

The user role does not have to be a real human. If you wrap an LLM call in an API, your server is putting words in the user role on behalf of whoever called it. The roles are just labels the model uses to disambiguate.

Tokens: the currency of LLMs

Models do not see characters or words. They see tokens, which are roughly 3 to 4 characters of English text each. The word hamburger is one token. The word antidisestablish is several. Punctuation, whitespace, and emoji all count.

Every model has a context window: the maximum number of input + output tokens it can handle in a single call. Claude Sonnet 4.5 handles 200K tokens, with a 1M token beta for longer inputs. GPT models range from 128K to 1M depending on the variant. The window is hard: hit it and the API rejects you.

Input tokens cost less. Output tokens cost more (often 4 to 5x). Generating is more expensive than reading.
Long system prompts get expensive at scale because you pay for them on every single call.
Truncate or summarize old chat history before the window fills up. Otherwise the API just errors out.

Count before you call

Most SDKs expose a token counter. Run it on your prompt during development so you know your real per-call cost, not the "works on my laptop" cost.

Streaming: text as it's generated

Without streaming, you wait for the entire response, then get a JSON blob. For a 500-word answer that can be 10+ seconds of staring at a spinner. With streaming, the server sends tokens as they are generated, so your UI can render them one chunk at a time. This is what makes ChatGPT feel snappy.

Under the hood, providers use Server-Sent Events (SSE) or a similar long-lived HTTP response. Each chunk is a small JSON delta. You concatenate them on the client.

streaming.ts

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const stream = client.messages.stream({
  model: "claude-opus-4-5",
  max_tokens: 1024,
  messages: [
    { role: "user", content: "Write a haiku about TCP packets." },
  ],
});

for await (const event of stream) {
  if (
    event.type === "content_block_delta" &&
    event.delta.type === "text_delta"
  ) {
    process.stdout.write(event.delta.text);
  }
}

const final = await stream.finalMessage();
console.log("\nstop reason:", final.stop_reason);

Browsers can consume the same stream using the Fetch API and a ReadableStream. The official SDKs (@anthropic-ai/sdk, openai) hide the byte-level details and give you an async iterator like the one above.

Prompt caching: pay once for the boring parts

Real apps have huge system prompts. A coding assistant might prepend 20 pages of style guide, internal docs, and few-shot examples to every single turn. That gets very expensive very fast.

Prompt caching lets you mark a prefix as cacheable. The provider stores it server-side for a few minutes. Subsequent requests that reuse the same prefix pay roughly 10% of the input cost for that portion (Anthropic charges 1.25x to write and 0.1x to read). On a 50,000-token system prompt, the difference is dramatic.

caching.ts

const res = await client.messages.create({
  model: "claude-opus-4-5",
  max_tokens: 512,
  system: [
    {
      type: "text",
      text: longCompanyHandbook, // 50,000 tokens
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userQuestion }],
});

Cache hits are best effort

Cache entries expire after about 5 minutes of inactivity. If your app has bursty traffic, the first request after a quiet period still pays full price. Design for the cache, but do not bet your margins on a 100% hit rate.

Sampling knobs you'll touch

temperature (usually 0 to 1): higher means more random and creative output. Use 0 for code, fact extraction, or anything where determinism matters. Use 0.7+ for brainstorming and prose.
max_tokens: ceiling on the output. The model stops here even mid-sentence. Set it generously but not infinitely.
stop_sequences: strings that, when generated, cause the model to stop. Handy when you want output to end at a specific delimiter.

Quick quiz

Quiz1 / 3

Why do chat APIs require you to send the entire message history every time?

Recap

Chat APIs take a messages array of system / user / assistant turns. The model is stateless.
Tokens are the billing unit. Output costs more than input. Context windows are hard caps.
Streaming uses SSE-style chunks so the UI updates as text is generated.
Prompt caching cuts cost dramatically when a long prefix is reused across calls.
temperature controls randomness. Low for code, high for prose.