"Does my prompt fit?" "How much will it cost?" "Will I blow the context window?" — three questions every AI app developer has been asked by their PM. But OpenAI / Anthropic / Google all tokenize differently, and Chinese vs English tokens differ even more. This guide uses our `token-counter` tool as the worked example to break down real token-estimation logic, the cost comparison across 5 leading models, and 3 practical cost-saving techniques.
Tokens aren't words or characters — they're BPE subwords
Tokenizers typically use BPE (Byte Pair Encoding) — frequently co-occurring subwords are merged into single tokens.
"hello"= 1 token"unhappiness"=un+happiness= 2 tokens"你好嗎"=你+好+嗎= 3 tokens (one per character)"中華電信"=中華+電信= 2 tokens (depends on tokenizer corpus)
So Chinese is roughly 1–2 tokens per character, English is roughly 1.3 tokens per word, Japanese / Korean fall between. When sizing prompts, multiplying word count × 1.5 gives a safe upper bound — better to over-estimate than blow the context window.
5 model cost comparison (2026 real pricing)
Assume a 5000-token prompt + 500-token reply per call. Cost per 1000 calls:
- GPT-4o ($2.5/M input, $10/M output): $12.5 + $5 = $17.5
- Claude 3.5 Sonnet ($3/M, $15/M): $15 + $7.5 = $22.5
- Gemini 1.5 Pro ($1.25/M, $5/M): $6.25 + $2.5 = $8.75
- GPT-4o-mini ($0.15/M, $0.6/M): $0.75 + $0.3 = $1.05
- Claude 3.5 Haiku ($0.8/M, $4/M): $4 + $2 = $6
Takeaway: Gemini Pro is ~60% cheaper than Claude Sonnet for the same task. But quality ranks Claude > GPT-4o > Gemini on complex reasoning. Cheaper ≠ cheaper-in-practice — debugging a wrong answer can cost more than 1000 runs.
What happens when context fills up? Three failure modes
Context window is the maximum token span the model can "see": GPT-4o = 128K, Claude 3.5 = 200K, Gemini 1.5 = 1M+.
Exceed it and you hit one of three modes:
- Hard error (OpenAI, Anthropic): API returns 400
context_length_exceeded - Head truncation (some legacy / wrapper layers): earlier conversation silently gets dropped, the model "forgets" what you said
- Quality collapse (close to the ceiling): not exceeded yet, but the model's attention to earlier content degrades — hallucination, ignored instructions
Rule of thumb: at 80% of the window, actively compress. Don't wait for the error. For long sessions, use summarize-and-discard: every 10 turns, ask the model to summarize the previous 10 turns and replace the originals.
3 practical cost-saving techniques
1. Move few-shot examples to system prompt + prompt caching Claude and OpenAI both support prompt caching — repeated system prompts hit cache, cost drops to ~1/10. Stuff all few-shot examples into the system prompt; user prompt carries only the task at hand.
2. Use JSON over natural language for structured data
"The user's name is Mark Liu, age 32, an engineer" = ~15 tokens
{"name":"Mark Liu","age":32,"job":"engineer"} = ~12 tokens
JSON looks heavier, but BPE optimizes { : " heavily — actual density is higher.
3. Use GPT-4o-mini for routing / classification, escalate hard cases Over 70% of LLM tasks are classification / extraction / formatting — GPT-4o-mini and Claude Haiku handle them just fine. Treat them as Layer 1, only escalate complex reasoning to large models. Average cost drops by ~80%.
Why this tool uses a heuristic instead of a real tokenizer
OpenAI's tiktoken, Anthropic's tokenizer, etc — each requires loading a ~1 MB vocab dictionary + ~500 KB WASM runtime. For an estimation tool, adding 1.5 MB of bundle just for ±2% precision isn't worth it.
Our heuristic:
- Chinese / CJK chars: 1.5 tokens each (averaged from 1–2)
- English words: BPE-style subword splitting, ~1.3 tokens per word
- Punctuation / numbers / emoji: 1 token each
- Code: identify common keywords, compress estimate
Typical error ±10% — enough for budget estimation and context-fit checks. For exact billing, the API response's usage.total_tokens is the source of truth.
Next time you're picking an AI model or launching a feature, run a quick check with the AI Token Counter — see your prompt cost across 5 models. The estimation runs in the browser — your prompt never leaves the machine.