New AI Tools
banner

Qwen2.5-Omni


Introduction:

Qwen2.5-Omni, a brand new multimodal large model open-sourced by Alibaba









Qwen2.5-Omni is the latest flagship end-to-end multimodal model released by Alibaba's Qwen team. It is the first end-to-end multimodal large model in the Qwen series.

Core Capabilities

  • Multimodal Perception and Generation: Qwen2.5-Omni can understand and process multiple input modalities including text, images, audio, and video, and can generate text and natural speech synthesis output in a real-time streaming manner. This capability allows it to perceive the world in a way that is close to human multisensory perception and to interact in real-time.

  • Real-time Audio and Video Interaction: The architecture of Qwen2.5-Omni supports completely real-time interaction, including chunked input and immediate output.

  • Natural and Robust Speech Generation: Qwen2.5-Omni exceeds many existing streaming and non-streaming alternatives in terms of naturalness and robustness in speech generation, demonstrating outstanding performance. Its speech generation evaluation scores even match those of humans.

  • Outstanding Cross-modal Performance: In benchmark tests against equally sized unimodal models, Qwen2.5-Omni exhibits outstanding performance across all modalities. For example, in audio capabilities, it outperforms similar-sized Qwen2-Audio and has performance on par with Qwen2.5-VL-7B.

  • Excellent End-to-End Speech Command Following Capability: Qwen2.5-Omni performs comparably to text input in end-to-end speech command following, demonstrating excellent performance in benchmarks such as MMLU for general knowledge understanding and GSM8K for mathematical reasoning.

  • Innovative Technology: The model adopts Time-aligned Multimodal RoPE (TMRoPE) this novel positional embedding, used to synchronize video input with audio timestamps.

Qwen2.5-Omni has demonstrated strong performance in multiple benchmark tests.

  • Multimodal Fusion Tasks (OmniBench): Qwen2.5-Omni in OmniBench and other authoritative multimodal fusion task evaluations set new industry records, performing far better than Google's Gemini-1.5-Pro and other similar models. It achieved State-of-the-Art (SOTA) performance on OmniBench.
  • Unimodal Tasks: In unimodal tasks, Qwen2.5-Omni performs excellently in multiple fields:
  • Speech Recognition (ASR): It performs well on datasets such as Common Voice. For example, on Common Voice 15 in en and zh languages, Qwen2.5-Omni-7B outperformed Qwen2-Audio and Whisper-large-v3.
  • Translation (S2TT): It demonstrates strong translation capabilities on datasets such as CoVoST2, even surpassing Qwen2-Audio in some languages.
  • Audio Understanding (Audio Reasoning): In the MMAU benchmark test, Qwen2.5-Omni-7B significantly outperformed Qwen2-Audio and Gemini-Pro-V1.5 in average scores for sound, music, and speech.
  • Image Reasoning: It performs well in benchmarks such as MMMU, MMStar, and achieved the best results on MMStar.
  • Video Understanding: It demonstrates strong video understanding capabilities in benchmarks such as MVBench and achieved the best results on MVBench.
  • Speech Generation: It performs excellently in Seed-tts-eval and subjective naturalness evaluations, and leads in content consistency and speaker similarity.
  • Text Capabilities: It demonstrates strong performance in text benchmarks such as MMLU, GSM8K, and HumanEval. Although it may slightly underperform specialized text models in some text tasks, considering its multimodal capabilities, this performance is already very impressive.

Qwen2.5-Omni is an end-to-end model that achieves globally leading multimodal performance in a 7B size, possessing powerful multimodal understanding and generation capabilities, as well as excellent real-time interaction performance, providing strong AI empowerment for developers and enterprises in various application scenarios.