Microsoft Launches MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2

Microsoft unveiled MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 on April 2 — three new in-house foundation models now available in Microsoft Foundry.

Dr. Nova Chen★Apr 6, 2026★5 min read

Microsoft's MAI Superintelligence Team Steps Forward

On April 2, 2026, Microsoft's in-house AI research unit — the MAI Superintelligence team, formed in November 2025 under the leadership of Mustafa Suleyman, CEO of Microsoft AI — unveiled three new foundation models: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. All three are available immediately via Microsoft Foundry and the MAI Playground.

The announcement marks a meaningful expansion of Microsoft's own model portfolio, complementing rather than replacing the company's longstanding partnership with OpenAI. What makes these models notable is not just their capabilities, but the depth of engineering investment they represent — each is designed to compete at the top of its category while offering cost advantages that make them attractive at enterprise scale.

MAI-Transcribe-1: Enterprise-Grade Speech Recognition

MAI-Transcribe-1 is a state-of-the-art speech-to-text model covering the 25 most widely spoken languages globally, benchmarked against the industry-standard FLEURS evaluation suite. The headline performance metric is compelling: batch transcription throughput runs 2.5x faster than Microsoft's existing Azure Fast offering, while maintaining accuracy that places it among the strongest models in its category.

For organizations running high-volume transcription workloads — call center analytics, meeting intelligence, broadcast captioning, accessibility tooling — the combination of multilingual coverage, accuracy, and speed represents a meaningful upgrade. Pricing starts at $0.36 per hour of audio transcribed.

MAI-Voice-1: Natural, Emotionally Rich Voice Generation

MAI-Voice-1 targets the voice generation market with a model built to produce speech that preserves speaker identity across long-form content. The model delivers natural prosody, emotional range, and expression control that allow generated voices to remain coherent and distinctive throughout extended passages — a capability gap that has historically challenged voice synthesis models in long-form applications.

The practical applications are broad: content creators building narration pipelines, accessibility tools converting text to speech, interactive voice applications, and enterprise platforms embedding spoken interfaces. At $22 per million characters, the pricing is competitive for production deployments. MAI-Voice-1 is already powering voice output in Microsoft Copilot products.

MAI-Image-2: Top-3 Image Generation With 2x Speed

MAI-Image-2 has achieved a top-3 ranking on the Arena.ai image generation leaderboard and delivers generation speeds at least 2x faster than its predecessor in Microsoft Foundry and Copilot. For developers building image-generation workflows or creative professionals using Copilot, the combination of quality and speed is the most significant upgrade yet in Microsoft's image AI stack.

Pricing is structured at $5 per million text input tokens and $33 per million image output tokens — competitive with other frontier image generation services.

The Strategic Picture

The three models were developed by Microsoft's dedicated MAI Superintelligence team, separate from the teams building on OpenAI's models. This organizational structure gives Microsoft an increasing ability to develop first-party AI capabilities without depending on any single external partner — a strategic diversification play as AI becomes a core infrastructure layer for every Microsoft product.

For enterprise developers and architects evaluating their AI infrastructure: all three models are available now via the Foundry API. Organizations running Azure AI workloads can adopt MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 directly through the platforms they already use.

Sources: Microsoft AI Blog (April 2, 2026), TechCrunch (April 2, 2026), VentureBeat (April 2, 2026), GeekWire (April 2, 2026)