The economics of AI translation are undergoing a structural shift. For over a decade, the localization industry relied on a stable set of neural machine translation (NMT) engines (Google Translate, DeepL, Amazon Translate, Microsoft Translator), each with straightforward per-character pricing and predictable cost curves.
That era is mostly over.
Large language models (LLMs) have entered the translation workflow. Not as a curiosity or research experiment, but as production-grade alternatives that an increasing number of LSPs, enterprises, and localization platforms are integrating into live workflows.
The question facing project managers, localization engineers, and translation buyers is no longer whether LLMs can translate; it’s whether the economics make sense, and under what conditions.
This article provides a rigorous, normalized cost comparison between traditional MT engines and the LLMs available via API as of early 2026.
It covers pricing architectures, establishes a common unit of comparison (cost per 1 million source words), accounts for the hidden overheads that inflate LLM costs beyond their headline token rates, and contextualizes the numbers against quality and capability.
Methodology: An apples-to-apples comparison.
Before diving into the specific AI translation costs, we must address the fundamental incompatibility between how MT engines and LLMs charge for translation.
How MT engines charge
Traditional MT APIs bill per character of source text sent to the API, which is clean and deterministic. If you send 1 million characters, you know exactly what you’ll pay.
The output length is irrelevant to billing, with the notable exception of Google’s newer LLM-based translation offering, which charges for both input and output characters.
Whitespace, punctuation, and markup tags all count toward billing by default. However, some APIs exclude markup when using tag-handling modes. For example, DeepL’s tag_handling=xml parameter excludes tag characters and attributes from the billed character count.
How LLMs charge
LLMs bill per token, both input tokens (what you send) and output tokens (what the model generates). A token is a subword unit, not a character or a word.
In English, one token averages roughly 4 characters or 0.75 words. Critically, the cost of an LLM translation includes not just the source text but also:
- The system prompt: Instructions telling the model to act as a translator, specifying source and target languages, tone, terminology constraints, and formatting rules. Commonly seen ranges are 200–500 tokens, though teams with aggressive prompt caching may use substantially larger system prompts.
- Few-shot examples (optional but common): Providing 2–5 example translation pairs to guide style and terminology. This commonly adds 500–2,000 tokens of input overhead per request.
- The translated output: LLMs charge separately for generated tokens, and output tokens are almost always more expensive than input tokens, often 3–8x more.
The normalization unit
To enable direct comparison, this article normalizes all costs to USD per 1 million source words translated, a unit familiar to localization professionals. The conversion assumptions are:
- One English word ≈ 5.5 characters (including spaces).
- One million words ≈ 5.5 million characters.
- One English token ≈ 0.75 words, so 1 million words ≈ 1.33 million tokens of source text.
- Output length is assumed equal to input length, which is a reasonable average, though output length varies significantly by language pair. Romance languages like Spanish and French typically run 15–25% longer than English, while Chinese is often 30–50% shorter in character count despite similar semantic content.
- For LLMs, a 300-token system prompt is included in each API call, but the number of calls depends heavily on chunking strategy. With long-context models (128K+ windows), translation workloads are typically batched into large chunks of around 30,000 words per request, yielding roughly 33 calls per million words and around 10K tokens of prompt overhead, which is negligible relative to the source text.
With short-context models or segment-level workflows, such as 500 words per request, the same prompt repeats around 2,000 times, adding around 600K tokens of overhead. The comparison tables below use the long-context assumption of roughly 10K prompt tokens, which reflects current best practice for most LLMs listed. Practitioners using segment-level chunking should add 1–3% to input token costs to account for the additional prompt overhead.
These assumptions are deliberately conservative for LLMs and use an English-centric lens, as English remains the most common source language in commercial translation.
Practitioners should note that token and word economics differ materially for CJK languages, where a single character can carry far more semantic weight than an English word but may consume multiple tokens depending on the tokenizer, and morphologically rich languages like Finnish, Turkish, or Hungarian, where long compound words inflate character counts without proportionally increasing semantic content.
A full treatment of per-language cost modeling is beyond this article’s scope, but the framework below can be adapted with language-specific multipliers.
AI translation costs: How much do MTs cost?
The MT engine market in 2026 has a remarkably stable pricing structure, with most providers clustered in a narrow band. Here is the current state of play.
Google Cloud Translation
Google offers multiple translation tiers through its Cloud Translation API:
- NMT (Basic and Advanced): $20 per million characters. This is the standard neural MT that has been the industry default for years. Advanced adds glossary support, batch translation, and custom model capabilities at the same base price.
- Translation LLM: $10 per million input characters + $10 per million output characters. This is Google’s LLM-powered translation endpoint, which effectively costs $20 per million characters (assuming roughly equal input/output lengths), making it cost-equivalent to standard NMT by design.
- Adaptive Translation: $25 per million input characters + $25 per million output characters. This premium tier uses LLMs with user-provided examples to continuously improve output quality.
- Custom (AutoML) models: $30–80 per million characters depending on volume (tiered: $80 up to 250M chars, $60 for 250M–2.5B, $40 for 2.5B–4B, $30 above 4B), plus $45/hour training costs (capped at $300 per job).
- Free tier: 500,000 characters/month, no expiration.
Normalized cost per 1M source words (NMT): $110.00
DeepL
DeepL’s API operates on a base-fee-plus-usage model:
- API Free: 500,000 characters/month, limited features.
- API Pro: Monthly base fee (starting around $5.49/month, varies by billing currency and region) + ~$25 per million characters (USD pricing; exact rate may vary). No hard character cap.
- Language coverage: Over 100 languages as of late 2025, following a major expansion that added approximately 70 new languages. DeepL’s historical strength remains in European language pairs, but its coverage now rivals the breadth of traditional MT engines.
Normalized cost per 1M source words: ~$137.50 (excluding the base fee, which becomes negligible at volume)
Amazon Translate
Amazon’s offering is the most straightforward:
- Standard pricing: $15 per million characters.
- Active Custom Translation: $60 per million characters (with parallel data customization).
- Free tier: 2 million characters/month for 12 months after signup.
Normalized cost per 1M source words: $82.50
Microsoft Translator (Azure)
Microsoft offers competitive pricing with generous free allowances:
- Free tier (F0): 2 million characters/month — the most generous free tier among major providers.
- Standard (S1): $10 per million characters.
- Custom Translation: $40 per million characters + model hosting fees + training charges (capped at $300 per training run).
- Volume discounts: Available through commitment tiers for high-volume standard translation.
Normalized cost per 1M source words: $55.00
ModernMT
ModernMT, widely used within the localization industry and integrated into many TMS platforms, uses a subscription-plus-usage model rather than pure per-character pricing.
Published tiers include $15 per million characters for individual/small team plans and up to $100 per million characters for enterprise “Localization Teams” plans, plus monthly subscription fees. Effective per-character costs vary significantly by plan and volume commitment.
Normalized cost per 1M source words: ~$82–550 (varies widely by plan tier)
| MT Engine | Per-Million-Character Rate | Cost per 1M Source Words | Free Tier |
|---|---|---|---|
| Microsoft Translator | $10 | $55.00 | 2M chars/month |
| Amazon Translate | $15 | $82.50 | 2M chars/month (12 months) |
| Google Cloud Translation (NMT) | $20 | $110.00 | 500K chars/month |
| DeepL API Pro | $25 | $137.50 | 500K chars/month |
| Google Adaptive Translation | $25 + $25 (in/out) | $275.00 | None |
The range for standard NMT spans roughly $55–$138 per million source words. This is the baseline against which LLM costs should be measured.
AI translation costs: How much do LLMs cost?
Before we can address the specific sorts of most AI translation models (usually LLMs), we need to clarify why we picked the models we picked.
As of today (April 2026), the market is extraordinarily crowded. Users can choose from OpenAI’s GPT family, Anthropic’s Claude, Google’s Gemini, xAI’s Grok, Mistral’s models, and Meta’s Llama. That’s just from Western providers.
In China, the roster compounds with DeepSeek, Alibaba’s Qwen, Moonshot AI’s Kimi, Zhipu’s GLM, ByteDance’s Doubao, and others. A comprehensive pricing guide for every model would be impractical. To keep this analysis useful rather than exhaustive, we apply three filters.
- API availability for production use: The model must be accessible via a commercial API with published pricing, not just available as open weights on Hugging Face.
- Demonstrated translation quality: The model must have either published multilingual benchmarks, significant adoption in translation workflows, or both. This excludes models that are primarily optimized for coding or reasoning with limited multilingual evidence.
- Representation across price tiers: We include models from each major cost bracket (frontier, balanced, and cost-optimized) to give localization professionals a clear picture of the tradeoffs at each level.
This means we will include the latest models when possible (e.g., Claude 4.6 rather than older series). For Chinese LLMs, rather than listing every variant from a single provider (DeepSeek alone offers V3.1, V3.2, R1, and more), we select one representative model from each of the four leading companies.
Notable omissions include xAI’s Grok (strong general model but limited published translation benchmarks and less competitive on price), Mistral (excellent for European languages but narrower multilingual coverage than the models we include), and Meta’s Llama (primarily relevant for self-hosting, covered in that section).
These are not quality judgments; they reflect the scope constraints of a cost-focused analysis.
Note: The problem with reasoning models
The distinction between “reasoning” and “non-reasoning” models has become blurred to the point of near-meaninglessness, and this has direct cost implications for translation.
In 2024, the line was clear.
OpenAI’s o1 was a reasoning model; GPT-4o was not. You picked one or the other.
In 2026, nearly every frontier model ships with some form of optional or adaptive reasoning. GPT-5 and subsequent models default to medium reasoning effort, with minimal as the lowest available setting. Interestingly, developers on OpenAI’s forums have reported that default reasoning sometimes echoes the source text instead of translating because the reasoning layer interferes.
GPT-5.2 defaults to reasoning_effort=none and can be dialed up through low, medium, high, and xhigh. Claude Opus 4.6 uses “adaptive thinking,” where the model itself decides how much to reason based on task complexity. At the default high effort level, it almost always thinks. At low, it may skip thinking entirely.
Kimi K2.5 offers explicit “Thinking” and “Instant” modes. Even Gemini 3 Flash uses “dynamic thinking” by default.
For translation, this matters because thinking tokens are billed as output tokens, the most expensive category. A model that decides a translation requires deep reasoning will burn through output tokens on internal deliberation that adds no value to the final translation.
Practically, when using any model for translation via API, it is important to set the reasoning effort to the lowest available setting: none, low, disabled, or instant, depending on the provider.
This is not just a cost optimization.
Multiple developers have reported that reasoning actually degrades translation quality by overcomplicating what is fundamentally a pattern-matching and generation task.
LLM pricing is more complex, more variable, and, depending on which model you choose, can range from cheaper than MT engines to orders of magnitude more expensive.
The critical factors are model selection, prompt engineering efficiency, and whether you leverage cost-optimization features like caching and batching.
Tier 1: Frontier models
These are the most capable models available, typically offering the highest translation quality at the highest price.
1. OpenAI GPT-5.4 (released March 2026, current flagship).
- Standard tier:
- Input $2.50/M tokens
- Output $15.00/M tokens
- Cached input: $0.25/M tokens (90% discount)
- 1M+ token context window (922K input, 128K output). Pricing increases for contexts exceeding 272K tokens.
- Reasoning defaults to none; can be increased via the reasoning_effort parameter. OpenAI also offers Batch (50% off, 24-hour SLA), Flex (variable latency, lower cost), and Priority (faster processing, premium) tiers.
3. Anthropic Claude Opus 4.6 (released February 2026)
- Input: $5.00/M tokens
- Output: $25.00/M tokens
- Adaptive thinking is enabled by default; set effort to low for translation to minimize thinking token overhead.
- 1M token context window at standard pricing (no long-context surcharge).
4. Google Gemini 3.1 Pro (released February 19, 2026, currently in preview)
- Input: $2.00/M tokens
- Output: $12.00/M tokens (≤200K context)
- Long-context pricing (>200K tokens): $4.00/$18.00 per M tokens.
- Replaced Gemini 3 Pro at identical pricing with substantially improved benchmarks (77.1% ARC-AGI-2 vs. 31.1%, 94.3% GPQA Diamond). Preview pricing; GA pricing may differ.
Tier 2: Balanced performance models
These models offer near-frontier quality at significantly lower cost, which is often the sweet spot for production translation.
OpenAI GPT-4.1 (released April 2025)
- Input: $2.00/M tokens
- Output: $8.00/M tokens
- OpenAI’s best non-reasoning model. Excellent for translation where deep reasoning is unnecessary.
- 1M token context window available.
Anthropic Claude Sonnet 4.6 (released February 2026)
- Input: $3.00/M tokens
- Output: $15.00/M tokens
- Same pricing as its predecessor Sonnet 4.5, with improved capabilities. The current recommended mid-tier Claude model.
- 1M token context window at standard pricing.
Google Gemini 3 Flash (released December 17, 2025)
- Input: $0.50/M tokens
- Output: $3.00/M tokens
- Pro-grade reasoning at Flash-level speed and cost. Outperforms the previous-generation Gemini 2.5 Pro while being 3x faster and dramatically cheaper.
- Dynamic thinking by default; uses ~30% fewer tokens than 2.5 Pro on typical tasks.
- The single most compelling price/performance option in the current market for many translation use cases.
- Input: $0.60/M tokens | Output: $2.50–$3.00/M tokens
- Cache hit input: $0.10–$0.15/M tokens (75–83% discount)
- 1 trillion total parameters, ~32B active (MoE).
- Explicit “Thinking” and “Instant” modes — use Instant for translation to avoid reasoning token overhead.
- At ~$4.79 per million source words (Instant mode), it sits alongside GPT-4.1 Mini and Gemini 3 Flash in the mid tier.
GLM-5 (Zhipu AI / Z.AI, released February 11, 2026, MIT license) r
- Input: ~$0.80/M tokens
- Output: ~$2.56/M tokens
- 745B total parameters, ~44B active (MoE). Open-source weights on Hugging Face.
- At ~$5.59 per million source words, it lands in mid-tier pricing with quality that early benchmarks suggest approaches frontier models on general tasks. Independent translation benchmarks are still emerging.
Qwen3-Max (Alibaba, launched September 2025)
- International/Global deployment: Input $1.20/M tokens | Output $6.00/M tokens (≤32K context tier; prices increase for longer contexts)
- Mainland China deployment is significantly cheaper ($0.36/$1.43 at the lowest tier), a ~4x price gap. This article uses the international rate, since most non-Chinese localization teams will access it through the International (Singapore) or Global (US Virginia) deployment mode.
- Batch processing: 50% off, available on Alibaba Cloud.
- 119 languages supported, which is the broadest language coverage among Chinese LLMs, and competitive with many MT engines.
- Unlike the other three models in this section, Qwen3-Max weights are not publicly released. It is API-only (closed-source). The smaller Qwen3 variants (up to 235B) are open-weight under Apache 2.0, but Qwen3-Max represents a strategic shift to closed distribution for Alibaba’s flagship.
- At ~$9.58 per million source words (international standard), or ~$4.79 with batch discounts. Teams with Mainland China deployment access pay roughly $2.38 at standard rates.
Tier 3: Cost-optimized models
These smaller, faster models are dramatically cheaper and, for many language pairs, deliver translation quality that is surprisingly close to their larger siblings.
OpenAI GPT-5 Mini
- Input: $0.25/M tokens
- Output: $2.00/M tokens
- Excellent quality-to-cost ratio for translation tasks.
OpenAI GPT-5 Nano
- Input: $0.05/M tokens
- Output: $0.40/M tokens
- The cheapest OpenAI option. Quality is lower but serviceable for gisting and high-volume, low-stakes content.
OpenAI GPT-5.4 Mini (released March 17, 2026)
- Input: $0.75/M tokens
- Output: $4.50/M tokens
- Significantly more capable than GPT-5 Mini (approaches GPT-5.4 on tool-use benchmarks while running 2x faster). A meaningful quality upgrade for translation at a moderate cost increase.
OpenAI GPT-5.4 Nano (released March 17, 2026)
- Input: $0.20/M tokens
- Output: $1.25/M tokens
- A substantial quality upgrade over GPT-5 Nano (outperforms the older GPT-5 Mini on several benchmarks), but at 4x the input and 3x the output price of GPT-5 Nano. For translation quality-to-cost, GPT-5.4 Nano fills the gap between GPT-5 Nano and GPT-5 Mini.
OpenAI GPT-4.1 Mini / Nano
- Mini:
- $0.40/M input
- $1.60/M output.
- Nano:
- $0.10/M input
- $0.40/M output.
- Fast, affordable, and well-suited to translation routing. GPT-4.1 Mini is a particularly strong option for translation because it was designed as a non-reasoning model from the ground up.
Anthropic Claude Haiku 4.5 (released late 2025)
- Input: $1.00/M tokens
- Output: $5.00/M tokens
- Strong quality at a fraction of Sonnet’s cost; translation quality matches earlier Sonnet models.
Google Gemini 3.1 Flash-Lite (released March 3, 2026)
- Input: $0.25/M tokens
- Output: $1.50/M tokens
- The successor to Gemini 2.5 Flash-Lite, with meaningfully improved quality (outperforms 2.5 Flash on several benchmarks while being 2.5x faster to first token). Supports thinking levels for fine-grained cost/performance control. Priced at half the cost of Gemini 3 Flash, making it a strong contender for high-volume translation.
Google Gemini 2.5 Flash-Lite
- Input: $0.10/M tokens
- Output: $0.40/M tokens
- Still the absolute cheapest LLM translation option from a Western provider, but scheduled for deprecation on June 1, 2026. Teams currently using it should plan migration to Gemini 3.1 Flash-Lite or another cost-optimized model.
- Input: $0.28/M tokens (cache miss)
- Output: $0.42/M tokens
- Cache hit input: $0.028/M tokens (90% discount for repeated prompts)
- 671B total parameters, ~37B active (MoE). Strong multilingual performance, particularly for CJK languages.
- At $0.93 per million source words, it competes directly with GPT-5 Nano on cost while delivering meaningfully higher quality.
DeepSeek V4 (released March 2026, MIT license)
- Input: $0.30/M tokens (cache miss)
- Output: $0.50/M tokens
- Cache hit input: $0.03/M tokens (90% discount)
- The successor to V3.2, with meaningful quality improvements (81% SWE-bench Verified vs. V3.2’s 69%). V3.2 remains available at its original pricing for budget-conscious workloads.
- At $1.07 per million source words, which is still dramatically cheaper than any MT engine.
Note on hosting: Some of the lowest-cost models in this analysis are offered by Chinese providers or as open-weight models served through third-party inference platforms. For translation buyers, the key question is not just which model is being used, but where requests are processed.
A provider’s native API may route data to that provider’s own infrastructure, while the same model served through a third-party host may keep processing in the US or EU. Teams handling sensitive or regulated content should verify hosting location, retention terms, and contractual controls before deployment.
The pace of change affects AI translation costs
- Google’s Gemini 3.1 Pro.
- Anthropic’s Claude Opus 4.6 and Sonnet 4.6.
- Zhipu’s GLM-5.
- Moonshot’s Kimi K2.5.
- ByteDance’s 2.0.
- Alibaba’s Qwen 3.5.
Note: How the fast pace of the AI industry affects AI translation costs
AI translation costs aren't static
The pace of model releases across the entire AI industry has been extraordinary in late 2025 and early 2026. In a single two-week window in February 2026, we saw these releases:
Google's Gemini 3.1 Pro.
Anthropic's Claude Opus 4.6 and Sonnet 4.6.
Zhipu's GLM-5.
Moonshot's Kimi K2.5.
ByteDance's 2.0.
Alibaba's Qwen 3.5.
And the prior months were no slouches either. Google launched Gemini 3 Pro in November and Gemini 3 Flash in December. OpenAI shipped GPT-5.2 in December, GPT-5.3 Codex for coding, GPT-5.4 in March 2026, and GPT-5.4 Mini and Nano just days later on March 17.
DeepSeek followed V3.2 with V4 in early March. Google released Gemini 3.1 Flash-Lite on March 3, extending its 3.1-series down to the cost-optimized tier.
The cadence shows no sign of slowing.
At the time of writing (April 2026), a series of incoming models are in the works, each expected to outperform the current SOTA. Each of these has the potential to shift the competitive landscape, particularly on pricing, where each new generation has historically come in 30–60% cheaper than its predecessor.
As such, any pricing snapshot in this environment carries an expiration date. The framework and methodology in this article (normalizing to cost per million source words, accounting for prompt overhead, comparing across tiers) are designed to remain useful even as the specific numbers change.
But the specific model recommendations and price points should be verified against current rates before making procurement decisions.
How to compare AI translation costs: LLMs vs. MT
For the normalized LLM comparison, this article assumes that 1 million source words corresponds to approximately 1.33 million input tokens of source text, based on the working conversion of 1 token ≈ 0.75 words.
To reflect real translation workflows, the calculation also includes a small amount of prompt overhead: a 300-token system prompt repeated across roughly 33 long-context requests of about 30,000 words each, which adds approximately 10,000 tokens in total. On that basis, the total billed input volume is modeled as:
Input tokens = 1.33M source-text tokens + 10K prompt tokens ≈ 1.34M tokens
Output is assumed to be roughly equal in length to the source on average, yielding:
Output tokens ≈ 1.33M tokens
All quoted prices use standard real-time API rates taken from each provider’s official pricing page, with no batch discounts, no prompt caching, and no off-peak or flex pricing applied unless explicitly stated elsewhere.
For provider-specific assumptions, Qwen3-Max is calculated using international deployment pricing at $1.20 per million input tokens and $6.00 per million output tokens; mainland China deployment is materially cheaper and would reduce the normalized cost accordingly.
DeepSeek is calculated using cache-miss input pricing at $0.28 per million input tokens; workflows that achieve cache hits on repeated prompt material would lower the effective input-side cost substantially.
Standard rates (no discounts)
| Provider / Model | Tier | Cost per 1M Source Words | vs. Cheapest MT |
|---|---|---|---|
| Microsoft Translator | MT | $55 | baseline |
| Amazon Translate | MT | $83 | 1.5x |
| Google Cloud NMT | MT | $110 | 2.0x |
| DeepL API Pro | MT | $138 | 2.5x |
| GPT-5 Nano | CE LLM | $0.59 | 0.011x |
| GPT-4.1 Nano | CE LLM | $0.66 | 0.012x |
| Gemini 2.5 Flash-Lite | CE LLM | $0.66 | 0.012x |
| DeepSeek V3.2 | CE LLM | $0.93 | 0.017x |
| DeepSeek V4 | CE LLM | $1.07 | 0.019x |
| GPT-4.1 Mini | Balanced LLM | $2.66 | 0.048x |
| GPT-5 Mini | Balanced LLM | $2.99 | 0.054x |
| Gemini 3 Flash | Balanced LLM | $4.66 | 0.085x |
| Kimi K2.5 (Instant) | Balanced LLM | $4.79 | 0.087x |
| GLM-5 | Balanced LLM | $5.59 | 0.10x |
| Claude Haiku 4.5 | Balanced LLM | $7.99 | 0.15x |
| Qwen3-Max (Intl) | Balanced LLM | $9.58 | 0.17x |
| GPT-4.1 | Balanced LLM | $13.32 | 0.24x |
| Claude Sonnet 4.6 | Balanced LLM | $23.97 | 0.44x |
| Gemini 3.1 Pro (preview) | Frontier LLM | $18.64 | 0.34x |
| GPT-5.2 (Standard tier) | Frontier LLM | $20.96 | 0.38x |
| GPT-5.4 (Standard tier) | Frontier LLM | $23.30 | 0.42x |
| Claude Opus 4.6 | Frontier LLM | $39.95 | 0.73x |
With batch API discounts (50% off, available from most providers)
| Provider / Model | Standard Cost | Batch Cost | vs. Cheapest MT |
|---|---|---|---|
| GPT-5 Nano | $0.59 | $0.30 | 0.005x |
| DeepSeek V3.2 | $0.93 | $0.47 | 0.009x |
| DeepSeek V4 | $1.07 | $0.53 | 0.010x |
| GPT-5 Mini | $2.99 | $1.50 | 0.027x |
| Gemini 3 Flash | $4.66 | $2.33 | 0.042x |
| Qwen3-Max (Intl) | $9.58 | $4.79 | 0.087x |
| GPT-4.1 | $13.32 | $6.66 | 0.12x |
| GPT-5.2 | $20.96 | $10.48 | 0.19x |
| GPT-5.4 | $23.30 | $11.65 | 0.21x |
| Claude Sonnet 4.6 | $23.97 | $11.99 | 0.22x |
| Claude Opus 4.6 | $39.95 | $19.98 | 0.36x |
With prompt caching (90% discount on repeated input)
Prompt caching benefits are proportional to how much of your input is reusable. If your system prompt and few-shot examples constitute 30% of your input tokens, caching reduces effective input cost by ~27%.
For workflows where the same glossary and instructions are sent with every request (the typical translation pipeline scenario), the savings are significant (especially for models with high input token prices).
How LLMs became cheaper than MTs
The numbers so far reveal a striking fact that would have been unthinkable two years ago: the majority of LLMs are now cheaper than traditional MT engines for raw translation throughput.
This is not a marginal difference. At standard rates, GPT-5 Nano costs about $0.60 per million source words.
That’s roughly 186x cheaper than Google Cloud Translation NMT and 230x cheaper than DeepL. Even higher-quality balanced models like GPT-4.1 at $13.32 per million words still undercut every MT engine in the comparison.
The cost advantage becomes even more extreme when batch processing and caching are applied. A DeepSeek V3.2 translation pipeline with aggressive caching (where most input is cache-hit) can process a million words for roughly $0.59, though output tokens ($0.56/M words alone) form an irreducible floor that caching cannot reduce, since caching only discounts repeated input. Combining caching with batch or off-peak discounts could push the total under $0.50.
How did this happen?
MT engines were designed and priced in an era when they were the only automated translation option. Their pricing reflects not just compute costs but also years of R&D amortization, data licensing, enterprise sales infrastructure, and the premium of being a purpose-built solution.
They have not faced meaningful price pressure because competition was limited to other MT engines operating at similar cost structures.
LLMs, by contrast, are priced in a hyper-competitive market where providers are racing for developer adoption and market share. Translation is not the primary revenue driver for any LLM provider; it’s a use case that rides on infrastructure built for coding, reasoning, and general-purpose chat.
The marginal cost of serving a translation request is a tiny fraction of the model’s full capability cost, and aggressive pricing reflects a land-grab strategy rather than sustainable unit economics.
This raises a legitimate question about pricing durability: will LLM pricing remain this low? The trend so far has been relentlessly downward. OpenAI has cut prices on successive model generations by 50–80%. DeepSeek cut its API pricing by over 50% in September 2025. Google’s Gemini Flash models have gotten cheaper with each update.
The structural forces driving this (competition, architectural efficiency gains (especially MoE), growing inference infrastructure) show no signs of reversing. If anything, the entry of more Chinese providers and the proliferation of open-source models suggest that the floor has not yet been reached.
This inversion has profound implications for the localization industry, but it comes with important caveats that the raw numbers don’t capture.
Hidden AI translation costs
Cost per million words is a necessary but insufficient metric. Several factors can significantly alter the effective cost and value proposition of each option.
1. Engineering overhead
MT engines are purpose-built for translation. You send text in, you get a translation out. The API contract is simple, the output is predictable, and integration into TMS platforms is mature and standardized (most major TMS systems have native connectors for Google, DeepL, Microsoft, and Amazon).
LLMs require substantially more engineering investment. You need to design and maintain system prompts, handle chunking logic for long documents (most LLMs have output limits that require splitting source text into manageable segments), build retry logic for rate limits and timeouts, parse output that may include unwanted preamble or commentary, manage glossary injection, and handle format preservation (which MT engines do natively for formats like HTML and XLIFF but LLMs do not).
This engineering overhead is a real cost that doesn’t appear in the per-token price. For an LSP processing millions of words per month across dozens of language pairs, the development and maintenance cost of an LLM translation pipeline can easily run $50,000–$200,000 annually in engineering time. A cost that, when amortized across volume, may narrow or even eliminate the per-word cost advantage.
2. Output consistency and predictability
MT engines are largely deterministic. The same input produces the same output every time (assuming no model updates), and they can be run with fixed seeds for reproducibility.
LLMs are more overtly stochastic. The same input can produce different outputs across runs, even with temperature set to 0 (due to batching and floating-point non-determinism).
For localization workflows that depend on consistency (such as repeated segments, version updates, or translation memory matching), this variability introduces friction and potential rework costs.
That said, it’s worth noting that NMT systems are also neural models and can also hallucinate, thus producing fluent translations completely decoupled from the source text (a phenomenon studied extensively in NMT research since 2018).
The difference is one of degree and frequency: NMT hallucinations tend to be triggered by specific edge cases (out-of-domain input, rare vocabulary, noisy source text), while LLM hallucinations can occur more broadly and may be harder to detect because they’re often fluent and plausible.
3. Format handling
MT engines natively handle structured formats like HTML, XLIFF, and XML, preserving tags and formatting without special configuration. LLMs, by contrast, can damage, drop, or hallucinate tags.
The industry has developed workarounds (tag-replacement strategies, post-processing validators), but these add complexity and occasional failure rates that inflate effective cost.
4. Latency and throughput
For real-time or near-real-time translation (live chat, customer support, UI localization), latency matters. MT engines typically respond in 50–200ms for a sentence.
LLMs vary widely: cost-optimized models may respond in 200–500ms, but frontier models can take 2–10 seconds per request, especially with reasoning enabled. For high-throughput batch workflows, this latency translates directly to wall-clock time and, potentially, to compute costs if you’re paying for orchestration infrastructure.
5. Language pair coverage
MT engines cover 75–130+ languages, and DeepL now exceeds 100 following its late-2025 expansion. LLMs vary: frontier models perform well on 20–40 high-resource languages but degrade significantly on low-resource pairs.
For LSPs serving clients with diverse language needs, MT engines still provide the most consistent coverage across long-tail language pairs.
How translation quality maps to cost tiers
A cost analysis without a quality dimension is incomplete, but quality in machine translation is notoriously difficult to measure. The key points for localization professionals:
The benchmarking landscape
Traditional metrics like BLEU (which measures n-gram overlap with reference translations) are increasingly recognized as inadequate, particularly for evaluating LLM output.
LLMs often produce translations that are stylistically different from reference translations (more fluent, more idiomatic, sometimes more creative), which BLEU penalizes despite being preferred by human evaluators.
A translation that says “The company posted strong results” rather than the reference’s “The company reported positive outcomes” would score poorly on BLEU despite being a perfectly valid, arguably better, translation.
COMET (Crosslingual Optimized Metric for Evaluation of Translation), which uses neural encoders trained on human judgments, correlates better with human preferences.
However, as discussed extensively at WMT 2024, COMET also has significant blind spots for LLM evaluation. It can assign generous scores to fluent but unfaithful translations (“hallucinations”), and it struggles to capture the quality dimensions that matter most in production localization contexts—terminology accuracy, register appropriateness, domain fidelity, and brand voice consistency.
Researchers at WMT 2024 described this divergence as an “evaluation crisis,” noting that COMET’s training data rarely includes the kinds of confident-but-wrong outputs that LLMs occasionally produce.
The emerging consensus is that no single automated metric is sufficient. The most reliable approach combines COMET scoring for rapid screening with structured human evaluation using MQM (Multidimensional Quality Metrics) error typologies for production quality assurance.
Some organizations are also experimenting with “LLM-as-a-judge” approaches (using a frontier model to evaluate translations), which shows promise for scalability but introduces its own biases (models tend to prefer outputs that resemble their own style).
For the localization industry specifically, Intento’s annual State of Translation Automation report has become an important benchmark source, as it evaluates MT and LLM quality across a wide range of language pairs using production-relevant content types rather than academic test sets.
Quality tiers in practice
Based on available benchmarks, industry evaluations (including Intento’s State of Translation Automation 2025 and WMT 2024 results), and practitioner reports, the current quality landscape can be roughly tiered:
Tier 1 – Frontier quality (near-human for high-resource pairs)
GPT-5.4, GPT-5.2, GPT-4.1, Claude Sonnet 4.6 / Opus 4.6, Gemini 3.1 Pro. These models match or exceed the best NMT engines on major language pairs and often produce more natural, fluent output. They handle context, idiom, and register better than any MT engine.
Tier 2 – Production quality (comparable to best NMT)
GPT-5 Mini, Claude Haiku 4.5, Gemini 3 Flash, DeepSeek V4, DeepSeek V3.2, Kimi K2.5 (Instant mode), GLM-5, Qwen3-Max. These models deliver translation quality broadly equivalent to Google NMT or DeepL for high-resource pairs, sometimes better, sometimes slightly worse, depending on the language pair and domain.
Tier 3 – Serviceable quality (adequate for gisting, internal use)
GPT-5 Nano, GPT-4.1 Nano, Gemini 2.5 Flash-Lite. These are fast and extremely cheap but show noticeable quality drops on complex sentences, low-resource languages, and domain-specific content.
The critical insight is this: Tier 2 LLMs now deliver MT-equivalent quality at 10–100x lower cost than the MT engines they match. This is the core of the disruption.
