AI Token Counting: The Complete Guide — GPT / Claude / Gemini, Context Limits, Cost Saving

"Does my prompt fit?" "How much will it cost?" "Will I blow the context window?" — three questions every AI app developer has been asked by their PM. But OpenAI / Anthropic / Google all tokenize differently, and Chinese vs English tokens differ even more. This guide uses our `token-counter` tool as the worked example to break down real token-estimation logic, the cost comparison across 5 leading models, and 3 practical cost-saving techniques.

Tokens aren't words or characters — they're BPE subwords

Tokenizers typically use BPE (Byte Pair Encoding) — frequently co-occurring subwords are merged into single tokens.

"hello" = 1 token
"unhappiness" = un + happiness = 2 tokens
"你好嗎" = 你 + 好 + 嗎 = 3 tokens (one per character)
"中華電信" = 中華 + 電信 = 2 tokens (depends on tokenizer corpus)

So Chinese is roughly 1–2 tokens per character, English is roughly 1.3 tokens per word, Japanese / Korean fall between. When sizing prompts, multiplying word count × 1.5 gives a safe upper bound — better to over-estimate than blow the context window.

5 model cost comparison (2026 real pricing)

Assume a 5000-token prompt + 500-token reply per call. Cost per 1000 calls:

GPT-4o ($2.5/M input, $10/M output): $12.5 + $5 = $17.5
Claude 3.5 Sonnet ($3/M, $15/M): $15 + $7.5 = $22.5
Gemini 1.5 Pro ($1.25/M, $5/M): $6.25 + $2.5 = $8.75
GPT-4o-mini ($0.15/M, $0.6/M): $0.75 + $0.3 = $1.05
Claude 3.5 Haiku ($0.8/M, $4/M): $4 + $2 = $6

Takeaway: Gemini Pro is ~60% cheaper than Claude Sonnet for the same task. But quality ranks Claude > GPT-4o > Gemini on complex reasoning. Cheaper ≠ cheaper-in-practice — debugging a wrong answer can cost more than 1000 runs.

What happens when context fills up? Three failure modes

Context window is the maximum token span the model can "see": GPT-4o = 128K, Claude 3.5 = 200K, Gemini 1.5 = 1M+.

Exceed it and you hit one of three modes:

Hard error (OpenAI, Anthropic): API returns 400 context_length_exceeded
Head truncation (some legacy / wrapper layers): earlier conversation silently gets dropped, the model "forgets" what you said
Quality collapse (close to the ceiling): not exceeded yet, but the model's attention to earlier content degrades — hallucination, ignored instructions

Rule of thumb: at 80% of the window, actively compress. Don't wait for the error. For long sessions, use summarize-and-discard: every 10 turns, ask the model to summarize the previous 10 turns and replace the originals.

3 practical cost-saving techniques

1. Move few-shot examples to system prompt + prompt caching Claude and OpenAI both support prompt caching — repeated system prompts hit cache, cost drops to ~1/10. Stuff all few-shot examples into the system prompt; user prompt carries only the task at hand.

2. Use JSON over natural language for structured data "The user's name is Mark Liu, age 32, an engineer" = ~15 tokens {"name":"Mark Liu","age":32,"job":"engineer"} = ~12 tokens

JSON looks heavier, but BPE optimizes { : " heavily — actual density is higher.

3. Use GPT-4o-mini for routing / classification, escalate hard cases Over 70% of LLM tasks are classification / extraction / formatting — GPT-4o-mini and Claude Haiku handle them just fine. Treat them as Layer 1, only escalate complex reasoning to large models. Average cost drops by ~80%.

Why this tool uses a heuristic instead of a real tokenizer

OpenAI's tiktoken, Anthropic's tokenizer, etc — each requires loading a ~1 MB vocab dictionary + ~500 KB WASM runtime. For an estimation tool, adding 1.5 MB of bundle just for ±2% precision isn't worth it.

Our heuristic:

Chinese / CJK chars: 1.5 tokens each (averaged from 1–2)
English words: BPE-style subword splitting, ~1.3 tokens per word
Punctuation / numbers / emoji: 1 token each
Code: identify common keywords, compress estimate

Typical error ±10% — enough for budget estimation and context-fit checks. For exact billing, the API response's usage.total_tokens is the source of truth.

Next time you're picking an AI model or launching a feature, run a quick check with the AI Token Counter — see your prompt cost across 5 models. The estimation runs in the browser — your prompt never leaves the machine.