A New Generation of End-to-End Multimodal Flagship Model of Tongyi Qianwen Open Source

Recently, Alibaba Cloud officially released a new generation of end-to-end multimodal flagship model Qwen2.5-Omni-7B. It is understood that this is also the first end-to-end full-modal large model in Tongyi series models, which can simultaneously and seamlessly process various input forms such as text, image, audio and video, and simultaneously generate text and natural speech synthesis output through real-time streaming response.

According to the relevant information released by Alibaba Cloud, Qwen2.5-Omni-7B has demonstrated the world’s strongest all-modal excellent performance in a series of benchmark tests of the same scale, and its evaluation scores in the fields of voice understanding, picture understanding, video understanding and voice generation are all ahead of the special Audio or VL model, and its voice generation evaluation score is 4.51, which is equal to human ability.

Moreover, in the evaluation of multi-modal fusion task OmniBench, Qwen2.5-Omni-7B has also set a new industry record, and its full dimension far exceeds that of similar models such as Google Gemini-1.5-Pro.

It is understood that the reason why Qwen2.5-Omni-7B can achieve high performance lies in its adoption of a series of breakthrough innovative technologies, including the Thinker-Talker dual-core architecture pioneered by Alibaba Cloud Tongyi team, as well as the Position Embedding integrated audio and video technology, the position coding algorithm TMROPE (Time-Aligned Multimodal Rope) and so on. Among them, the Thinker-Talker dual-core architecture enables Qwen2.5-Omni-7B to have the "brain" and "voice generator" of human beings, forming an end-to-end unified model architecture, and realizing the efficient collaboration between real-time semantic understanding and voice generation.

At present, Qwen2.5-Omni-7B has been open source on Hugging Face, ModelScope, DashScope and GitHub, which supports developers and enterprises to download it for free. In addition, it is worth mentioning that the model can be deployed and run in smart terminal hardware such as mobile phones. In this regard, Alibaba Cloud said that compared with the closed-source large model with hundreds of billions of parameters, Qwen2.5-Omni has a small size of 7B, which makes it possible for the full-mode large model to be widely used in industry.

According to the public information, since 2023, Alibaba Cloud Tongyi team has successively developed more than 200 "full-size" large models covering parameters such as 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B and 110B, including text generation model, visual understanding/generation model, voice understanding/generation model, text generation model and video model.

[The picture in this article comes from the network]