This document discusses the rapid evolution of multilingual AI in text and voice processing, driven by Large Language Models (LLMs) and multimodal approaches. It highlights advances in cross-lingual natural language understanding (NLU), translation, automatic speech recognition (ASR), and text-to-speech (TTS), showcasing models like Google’s PaLM 2 and Gemini, OpenAI’s GPT-4 and Whisper, and Meta’s NLLB-200 and SeamlessM4T.

Introduction: The Global Language Landscape

In recent years, multilingual AI has rapidly evolved, transforming how organizations approach translation, Natural Language Understanding (NLU), and voice processing. As the world becomes increasingly interconnected, the ability for technology to understand and interact with diverse languages is crucial and so is this capability in both text and voice. This article presents an expert-level perspective, highlighting recent innovations, open challenges, and a comparative analysis between proprietary and open-source models. We summarize key highlights of research and applications in this space in the last couple of years.

State-of-the-Art: The Multilingual Space

Recent years have witnessed a revolution in NLP, driven by Large Language Models (LLMs) and multimodal approaches. These models have significantly improved the ability to process and generate text and speech in multiple languages. Let us review some advances in detail.

Multilingual Text

Advances in cross-lingual NLU have enabled AI systems to understand and generate text across dozens of languages in a unified model. Google’s PaLM 2 (and newer Gemini models) are trained on over 100 languages and demonstrate strong multilingual understanding on tasks like reading comprehension and reasoning (ai.google). OpenAI’s GPT-4 (and other models from OpenAI) are inherently multilingual; for example, GPT-4 can answer questions or summarize documents in many languages with high accuracy, making it a versatile NLU system for global use.

These models benefit from transfer learning: knowledge learned from high-resource languages (like English) and therefore can improve performance on low-resource languages, enabling zero-shot or few-shot understanding of languages the model saw little of during training. Despite progress, current state-of-the-art NLU models still perform best on languages with abundant training data – true “any-language” understanding remains an ongoing challenge.

In practice, big tech companies now incorporate multilingual NLU in products: Google’s MUM model (75+ languages) enhances search by transferring knowledge across languages, and Microsoft’s Z-Code models (Mixture-of-Experts) boost multilingual understanding in Office and Azure services. LLMs like GPT-4 and earlier smaller LMs like BERT have demonstrated remarkable cross-lingual transfer learning, enabling them to perform well in languages they were not explicitly trained on. Similarly, XLM-R from Meta is a model using self-supervised learning on 100 languages that achieved state-of-the-art results on cross-lingual benchmarks (e.g. XNLI). Multimodal models further enhance this by integrating visual and auditory data, allowing for more nuanced understanding and interaction.

One key automation usage in multilingual space in NLU + NLG capability is translation. Specifically focusing on translation as capability, Google Translate, now powered by Google’s PaLM 2 large language model, expanded to over 330 supported languages in 2024 (blog.google). This expansion (adding 110 new languages, including many African and indigenous languages) leveraged AI to cover an additional 614 million speakers worldwide. Meta AI similarly pushed the frontier with its No Language Left Behind (NLLB-200) model, supporting 200 languages with state-of-the-art translation quality. It has introduced a sparsely gated Mixture-of-Experts architecture to balance cross-lingual transfer vs. inference, markedly improving low-resource language translation (Kudo ai nature). Further, there are some specific challenges that have also been addressed in the past couple of years which have remained key problems in this space. For example: PaLM 2 can now translate idiomatic and poetic phrases more adeptly than prior models, which required significant fine tuning on smaller language models.

Multilingual Voice

Speech AI has seen major breakthroughs in multilingual capability in the last few years. On Automatic Speech Recognition (ASR), OpenAI’s Whisper model sets a new standard by transcribing voice in nearly 100 languages with high accuracy. Further its capabilities have showcased performance even on accented or noisy speech. Meta also released its Massively Multilingual Speech (MMS) project, which expanded speech-to-text and text-to-speech from ~100 languages to over 1,100 languages in one. It promises 4,000 spoken languages/dialects identification. This was achieved by leveraging audio recordings of translated religious texts to gather training data (refer to metablog). Amazon Lex and Amazon Transcribe also support more than 100 languages now, with noticeable accuracy improvement over foundational models with respect to multiple challenging acoustic environments.

In Text-To-Speech (TTS), cloud providers now offer high-quality neural voices in dozens of languages (e.g. Microsoft Azure, Amazon Polly and Google Cloud each support 100+ languages/variants with natural sounding TTS). Open-source efforts like Coqui’s TTS library have also enabled multilingual voice synthesis and cloning, allowing a user’s voice to be cloned and used to speak other languages.

Speech-to-speech translation – translating spoken input in one language to spoken output in another – is becoming reality. Meta’s SeamlessM4T (2023) is the first all-in-one multimodal translation model: it performs speech recognition, speech-to-text translation, text-to-text translation, and even direct speech-to-speech translation for nearly 100 languages in a single system. By handling multiple modalities together, SeamlessM4T reduces cascading errors and latency, moving closer to real-time voice translation. Google and others have demonstrated research prototypes like Translatotron, which translate speech directly to speech while preserving the original speaker’s voice characteristics. In summary, the state of the art in multilingual voice AI now enables transcribing, generating, and translating speech across many languages.

Multilingual NLP and Voice: A comparative snapshot

Below we present comparative tables for multilingual natural language understanding (NLU), multilingual translation (text-based MT), and multilingual voice (speech recognition, synthesis, and speech translation). Each table highlights leading proprietary and open-source models/providers, their language coverage, key strengths, limitations, and common use cases.

Multilingual NLU

Model/Provider	Language Coverage	Key Strengths	Limitations	Common Use Cases
OpenAI GPT-4 (Proprietary)	~26 deeply tested	Top-tier reasoning; strong in low-resource languages	Closed model; fixed knowledge; resource-heavy	Multilingual chatbots, cross-lingual Q&A, sentiment analysis
Google PaLM 2 (Proprietary)	100+ languages	Nuanced text understanding; efficient sizes	API-only; variable quality in niche languages	Bard, Workspace language tools, Google search
Microsoft Z-Code (Proprietary)	Dozens live, hundreds in dev	Efficient MoE architecture; top-tier in NLU and MT	Internal-only; high compute needs	Azure Translator, Office features
Meta XLM-R (Open Source)	100 languages	Strong cross-lingual understanding; excels in low-resource	Encoder-only; needs fine-tuning; older data	Text classification, NER, semantic search
OpenAI GPT-4 (Proprietary)	~26 deeply tested	Top-tier reasoning; strong in low-resource languages	Closed model; fixed knowledge; resource-heavy	Multilingual chatbots, cross-lingual Q&A, sentiment analysis

Translation

Model/Provider	Language Coverage	Key Strengths	Limitations	Common Use Cases
Google Translate (Proprietary)	249 languages	Unmatched breadth; good general quality	Quality varies; English-pivoted; lacks context	Web/app translations, localization drafts
Microsoft Translator (Proprietary)	100+ languages	Strong many-to-many MT; Azure-ready	Closed; limited coverage vs. Google	Office/Teams translation, enterprise use
Meta NLLB-200 (Open Source)	200 languages	Top for low-resource MT; 40k directions	Large model; non-commercial license	NGOs, research, underserved language support
DeepL Translator (Proprietary)	31 languages	High fluency; preferred in pro settings	Limited scope; paid tier	Legal/business translation, high-stakes content

Multilingual Voice

Model/Provider	Language Coverage	Key Strengths	Limitations	Common Use Cases
Google USM / WaveNet TTS (Proprietary)	ASR: 100+ / TTS: 40+	State-of-art ASR/TTS; natural voices	Cloud-only; variable low-resource TTS	YouTube captions, Assistant, Cloud APIs
Microsoft Azure Speech (Proprietary)	ASR: ~100 / TTS: 140+	Customizable, accurate ASR/TTS; multilingual personas	Closed; cost and quality vary	Teams transcripts, IVRs, content narration
Amazon Polly / Lex / Transcribe (Proprietary)	ASR: 100+ / TTS: a dozen languages	State-of-art ASR/TTS; natural voices	Cloud-only; TTS available in multiple qualities	Amazon Alexa
OpenAI Whisper (Open Source)	99 languages	Strong out-of-the-box ASR; English S2T	English-only output; slow on CPU	Subtitling, voice input apps, audio archives
Coqui TTS (Open Source)	20+ supported, 1000+ possible	Flexible TTS with voice cloning; lightweight	Limited polish; community-supported	Voice cloning, offline TTS, embedded systems

Open Challenges in Multilingual NLP and Voice

Despite advancements, several challenges remain:

Low-Resource Languages and Data Scarcity

Quality drops significantly for languages with little training data. Many of the world’s 7000+ languages lack large text or speech corpora, making it hard for AI to learn them. Efforts like NLLB and MMS began tackling this via transfer learning and novel data mining, but performance and coverage for low-resource tongues (e.g., many African, indigenous, or oral-only languages) are still far behind major languages.

Context and Cultural Nuance

Most translation systems still translate sentence by sentence, with little memory of prior context. This leads to mistakes in meaning for pronouns, formality, or domain-specific terms that require context to resolve. Models need mechanisms for long-term memory or discourse awareness to handle multi-sentence context effectively. These issues specifically become more crucial for long form content like books and novels, where the context continuity is much longer and can be multi-threaded. Similarly, NLU tasks like cross-lingual coreference or sentiment analysis can fail if cultural and contextual subtleties aren’t understood.

Real-Time translation and scalability

There is a gap between high-quality translation and real-time responsiveness. There is ongoing work that uses adaptive chunking and anticipation to improve this, but true real-time, flowing translation still showcases scope for improvement. Reducing inference latency for huge models is also a practical scalability issue for real-time deployment. The most advanced multilingual models are extremely large (billions of parameters) and resource-intensive and therefore pose scalability challenges.

Evaluation and Benchmark Gaps

New multilingual benchmarks (e.g. FLORES-200 for translation have appeared, but lack in quality against Human Eval(Evaluation), which doesn’t scale well. Ensuring fairness and avoiding biases in multilingual AI (e.g., offensive mis-translations or poorer results for certain dialects) is also an ongoing concern.

Experiential Learning: Multilingual work at Sahaj

At Sahaj, our multilingual AI work spans several significant client engagements. For an automobile-sector client, we’ve built sophisticated conversational engines that support multiple low-resource Indic languages with robust Natural Language Understanding capabilities, enabling seamless transitions between languages and contexts. This solution includes advanced multilingual fine-tuning tailored to varied conversational use cases. Additionally, for the same client, we’re developing an ongoing voice-based system across ten Indic languages, integrated as a personalized voice interface layered atop a multi-agent search ecosystem.

Another notable engagement involves a mission-driven organization operating within the agricultural domain. Here, we’ve created a voice-enabled chat application designed for low-resource languages such as Swahili and Hindi. This solution leverages efficient speech-to-text and translation capabilities to power a Retrieval-Augmented Generation (RAG) engine. The curated Knowledge base powering RAG is also multilingual. Underlying multilingual embedding models cohesively represent a curated knowledge base that integrates geographical insights and linguistic contexts, effectively unifying local languages with English.

Within our Sahaj AI labs, we’re actively advancing research in multilingual NLU across diverse linguistic tasks. One of the works involves research that encompasses robust frameworks that recommend optimal multilingual models tailored to specific NLU applications and demonstrate the effectiveness of targeted fine-tuning techniques. (Paper1, Paper2). Concurrently, we’re developing specialized voice transcription capabilities optimized for handling heavily accented English speech characterized by colloquial vocabulary and varied speaking tonalities. Additionally, we have developed intelligence for automatic spoken language understanding and speech recognition for conversational assistants under challenging speaking modes and acoustic environments.

Recommendations and Takeaways

Multilingual AI capabilities in text and voice have advanced to the point where global applications are not only feasible but increasingly common. The decision of how to leverage these capabilities is strategic. Listed are a few recommendations for building multilingual capability in AI models whether text or voice:

Leverage Foundation Models, but Fine-Tune for Specific Needs: Large multilingual models (from GPT-4 to NLLB) provide an excellent base. Investing in fine-tuning or prompting these models on your domain or use-case – for example, fine-tuning a multilingual NLU model on your customer support data (in multiple languages) can help improve accuracy. This hybrid approach balances the breadth of foundation models with the depth of domain expertise.

Address Low-Resource Languages: Proactive efforts for Low resource language like curating data, or engaging human linguists to augment the AI and then leverage Transfer Learning to utilize pre-trained LLMs and adapt them to specific languages and tasks.

Operational efficiency for scale: Design consideration for real-time vs batch, open source vs proprietary, need for larger models vs SLMs, need to be evaluated for comprehensive scale and efficiency.

The pace of multilingual AI research is rapid. Breakthroughs like unified multimodal models (e.g. SeamlessM4T) hints at a future of seamless cross-lingual communication. Strategic planning should include experimentation with such new models.

Note: Thanks to Dr. Karthika and Dr. Ravindra for sharing their inputs and providing detailed feedback.

Multilingual AI in Text and Voice