Abstract
Tokenization—the process of segmenting text into discrete units—is foundational to modern Natural Language Processing (NLP) and multilingual large language models (LLMs). However, when applied across diverse languages and scripts, tokenization mechanisms can exhibit significant mismatches that degrade model performance, fairness, and resource efficiency. This article provides a comprehensive survey of tokenization mismatches across languages, examining their causes, manifestations, impacts on downstream tasks, and potential remedies. We review subword, byte-level, and character-level approaches; quantify fragmentation disparities; analyze economic and societal implications; and outline research gaps and future directions toward equitable, script‐aware tokenization.


1. Introduction

Tokenization underpins the transformation of raw text into model inputs, influencing vocabulary size, input length, and representational fidelity. While subword methods (e.g., Byte-Pair Encoding, WordPiece, SentencePiece) have succeeded in high-resource languages, multilingual models exhibit pronounced vocabulary imbalance, with high-resource or Latin-script languages dominating token space and low-resource or non-Latin scripts suffering from over-fragmentation and semantic degradation. These mismatches exacerbate model bias, inflate inference costs, and hinder performance on underrepresented languages.jetir+1

2. Tokenization Mechanisms

2.1 Word-Level, Character-Level, and Subword Tokenizers

  • Word-level tokenization fails on languages without explicit whitespace delimiters (e.g., Chinese, Thai) and yields massive vocabularies prone to out-of-vocabulary (OOV) issues.
  • Character-level tokenization guarantees coverage but loses morphological structure and lengthens sequences, straining model capacity.
  • Subword tokenization strikes a balance, segmenting into statistically frequent units. Yet shared subword vocabularies across languages lead to coverage disparities, as scripts and morphological patterns diverge.

2.2 Byte-Level and Hybrid Approaches

Byte-level methods (e.g., BPEmb) ensure lossless, script-agnostic segmentation but can generate semantically incoherent byte sequences for complex scripts and inflate sequence length. Hybrid frameworks attempt script-sensitive splits while preserving byte-level fallback, but struggle with normalization and canonicalization across Unicode variations.jetir

3. Causes of Cross-Language Tokenization Mismatches

3.1 Pretraining Data Imbalance

Large corpora like CommonCrawl or CC100 are skewed toward English and other high-resource Latin-based languages, biasing vocabulary allocation in multilingual models. Even with vast data, non-Latin scripts remain underrepresented.arxiv

3.2 Script and Orthographic Diversity

Scripts differ dramatically in unit structure (e.g., conjunct forms in Devanagari, agglutination in Dravidian languages), challenging one-size-fits-all subword models. Unicode normalization inconsistencies and ligature representations further inflate token variance.jetir

3.3 Morphological Complexity and Agglutination

Agglutinative languages (Tamil, Telugu, Finnish) produce long concatenated forms that generic subword models over-segment, losing morpheme integrity and semantic cues.jetir

3.4 Code-Mixing and Transliteration

Informal text often blends languages or transliterates scripts (Hinglish), further elevating OOV rates and fragmenting semantically coherent units.jetir

4. Quantifying Mismatches and Impacts

4.1 Fragmentation Disparities

Empirical studies show that non-Latin scripts can require up to more tokens than English for the same content, directly inflating inference costs and latency. Controlled experiments with BBPE across scripts confirm persistent fragmentation even when content and vocabulary size are held constant (Figure 3).aclanthology

4.2 Downstream Task Degradation

Over-fragmentation yields longer input sequences, increasing memory usage and slowing inference. It also dilutes semantic embeddings, reducing performance on tasks like named entity recognition, part-of-speech tagging, and machine translation, especially under zero-shot or low-resource settings.aclanthology+1

4.3 Economic and Fairness Considerations

Commercial LLM APIs charge by token count. Speakers of over-fragmented languages effectively pay more per unit information conveyed, exacerbating digital divides and accessibility inequities.aclanthology

5. Existing Mitigation Strategies

5.1 Language-Specific Tokenizers

Monolingual or script-specific tokenizers improve segmentation fidelity but require separate vocabularies and complicate model architectures. “Trans-tokenization” approaches map embeddings from high-resource to low-resource vocabularies using parallel corpora and SMT alignment to initialize token embeddings.arxiv

5.2 Token Alignment and Vocabulary Transfer

Methods like probabilistic token alignment (TokAlign) and contextual dynamic mapping aim to reconcile subword vocabularies across languages, enabling cross-lingual knowledge transfer and reducing fragmentation-induced performance gaps.openreview+1

5.3 Tokenizer-Free Models

Character- or byte-level encoder-only architectures (e.g., CANINE, ByT5) avoid subword biases but underperform due to lack of morpheme-level abstraction and require deep architectures to handle longer sequences.aclanthology

5.4 Adaptive and Script-Aware Tokenization

Recent research explores script-agnostic normalization layers, morphological analyzers, and dynamic token boundary inference using self-supervised signals to better capture language-specific constructs.arxiv+1

6. Open Challenges and Research Gaps

  • Standardized Benchmarks for Tokenization: Unlike GLUE for model evaluation, no widely adopted benchmark exists to measure token segmentation quality, coverage, and cross-lingual fidelity.
  • Annotated Multilingual Corpora: Scarcity of diverse, token-level annotated datasets for low-resource languages hampers tokenizer development and evaluation.
  • Code-Mixed and Informal Text: Robust tokenization strategies for social media, transliterated text, and rapid code-switching remain underexplored.
  • Dialect and Variant Coverage: Tokenizers struggle with dialectal variations and multiple scripts per language (e.g., Sindhi in Devanagari and Arabic), calling for multi-script normalization.
  • Computational and Ethical Constraints: Resource-limited institutions face barriers to developing custom tokenizers, while model fragmentation duplicates effort and embeds biases from training corpora.

7. Future Directions

  • Unified, Modular Tokenization Frameworks: Develop plug-and-play tokenization pipelines that dynamically adapt to script and morphological cues with minimal manual rule crafting.
  • Tokenization-Specific Evaluation Suites: Curate multilingual token-level datasets and metrics (semantic preservation, fragmentation rate, coverage) to standardize research progress.
  • Cross-Lingual Embedding Initialization: Expand trans-tokenization and probabilistic alignment to more languages and model families, leveraging multilingual parallel resources.
  • Lightweight Adapters and LoRA Integration: Employ parameter-efficient fine-tuning (e.g., LoRA) for language-specific token embeddings to reduce compute overhead.
  • Community-Driven Corpora Collection: Foster open, annotated corpora efforts across underrepresented languages and dialects, emphasizing inclusive, ethical data sourcing.

Conclusion
Tokenization mismatches across languages represent a critical bottleneck for equitable, efficient, and high-quality multilingual NLP. While subword methods have powered remarkable advances, their one-size-fits-all nature yields fragmentation and bias against low-resource and non-Latin scripts. Addressing these challenges requires a multifaceted approach: rigorous quantification of fragmentation disparities, development of tokenizer-aware benchmarks, creation of script- and morphology-sensitive tokenizers, and cross-lingual embedding transfer techniques. Collaborative efforts to curate annotated corpora, standardize evaluation, and integrate parameter-efficient adapters will drive the next generation of inclusive, fair multilingual LLMs.

  1. https://www.jetir.org/papers/JETIR2504A94.pdf
  2. https://arxiv.org/html/2408.04303v1
  3. https://arxiv.org/html/2502.12560v1
  4. https://aclanthology.org/anthology-files/anthology-files/pdf/emnlp/2023.emnlp-main.614.pdf
  5. https://aclanthology.org/2023.vardial-1.5/
  6. https://openreview.net/pdf?id=ksBhCsSUaE
  7. https://aclanthology.org/2023.findings-eacl.128.pdf
  8. https://arxiv.org/html/2410.12989v1
  9. https://www.mhtechin.com
  10. https://in.linkedin.com/company/mhtechin-india
  11. https://arxiv.org/html/2502.11104v1
  12. https://play.google.com/store/apps/details?id=com.mhtechin.content&hl=en_IN
  13. https://openreview.net/forum?id=sBxvoDhvao¬eId=In2OezBiS9
  14. https://unstop.com/c/mhtechin-892802
  15. https://github.com/haotian-liu/LLaVA/issues/661
  16. https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=6983&context=etd
  17. https://www.instagram.com/mhtechin/?hl=en
  18. https://ozanciga.wordpress.com/2023/10/25/multilingual-tokenization-why-is-it-so-important-and-how-to-make-your-model-sizes-smaller/
  19. https://www.debutinfotech.com/blog/nlp-tokenization-methods-types-tools
  20. https://www.sciencedirect.com/science/article/pii/S2667305324000115
  21. https://arxiv.org/pdf/2403.08688.pdf
  22. https://www.amazon.science/publications/token-alignment-via-character-matching-for-subword-completion
  23. https://arxiv.org/html/2404.11553v1
  24. https://arxiv.org/abs/2506.03523
  25. https://www.ndss-symposium.org/wp-content/uploads/bar2025-final13.pdf
  26. https://aclanthology.org/2024.lrec-main.965.pdf
  27. https://academic.oup.com/bioinformatics/article/40/4/btae196/7645044
  28. https://www.sciencedirect.com/science/article/pii/S1319157821001804
  29. https://aclanthology.org/2025.naacl-short.63.pdf
  30. https://link.springer.com/chapter/10.1007/978-3-031-64451-1_5
  31. https://openreview.net/forum?id=4VmagzA2Tp
  32. https://pmc.ncbi.nlm.nih.gov/articles/PMC11436924/
  33. https://www.marketsandmarkets.com/Market-Reports/tokenization-market-76652221.html
  34. https://arxiv.org/pdf/2402.14903.pdf
  35. https://stackoverflow.com/questions/67567587/python-bert-tokenizer-cannot-be-loaded
  36. https://link.springer.com/article/10.1007/s10791-007-9027-7
  37. https://github.com/qdrant/qdrant/issues/5258
  38. https://arxiv.org/html/2504.04264v1
  39. https://community.openai.com/t/troubleshooting-openais-whisper-model-resolving-incorrect-language-outputs-for-maithili-with-multilanguage-tokenizer/946321
  40. https://link.springer.com/article/10.1007/s10115-025-02520-4
  41. https://arxiv.org/html/2504.07053v1
  42. https://giguete.users.greyc.fr/pricai96/part4.html
  43. https://arxiv.org/html/2504.09378v1
  44. https://github.com/flairNLP/flair/issues/1672
  45. https://discuss.huggingface.co/t/what-to-do-when-huggingface-throws-cant-load-tokenizer/23046
  46. https://www.biorxiv.org/content/10.1101/2024.09.09.612081v2.full-text
  47. https://github.com/NVIDIA/NeMo/issues/6661
  48. https://www.newline.co/@zaoyang/cross-lingual-fine-tuning-key-techniques–e122c0bf
  49. https://www.spreedly.com/blog/card-tokenization-failures