DeepSeek has unveiled its latest AI model, Janus-Pro-7B, which it claims outperforms OpenAI’s DALL-E 3 and Stability AI’s Stable Diffusion 3 Medium in text-to-image generation tasks.
This announcement, made in a report titled “Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling,” highlights the model’s advancements in multimodal understanding and generation capabilities.
Janus-Pro-7B’s performance has been validated across multiple benchmarks, showcasing its capabilities in both multimodal understanding and text-to-image generation.
On the GenEval leaderboard for text-to-image instruction-following tasks, Janus-Pro-7B achieved a score of 0.80, surpassing Janus (0.61), OpenAI’s DALL-E 3 (0.67), and Stable Diffusion 3 Medium (0.74).
“Janus-Pro-7B achieved a score of 79.2 on the multimodal understanding benchmark MMBench and 0.80 on the GenEval leaderboard, outperforming state-of-the-art unified multimodal models, including DALL-E 3 and Stable Diffusion 3 Medium,” DeepSeek stated.
The model scored 79.2 on the multimodal understanding benchmark MMBench, significantly outperforming competitors such as Janus (69.4), TokenFlow (68.9), and MetaMorph (75.2).
Addressing previous shortcomings
The Janus-Pro-7B model builds on the foundation laid by its predecessor, Janus, by addressing critical challenges in visual encoding and generation tasks. It incorporates 72 million high-quality synthetic images with real-world data to achieve enhanced image outputs.
- According to the report, earlier models, including Janus, struggled with the conflicting demands of multimodal understanding and generation. To resolve this, Janus-Pro introduces decoupled visual encoding, enabling it to excel in both tasks.
“Janus-Pro incorporates improvements across three dimensions: training strategies, data, and model size,” the report states, adding that the model demonstrates scalability with two configurations—1B and 7B parameters.
- The original Janus model, validated at the 1B parameter scale, faced limitations due to its relatively small model capacity and limited training data.
- These constraints led to suboptimal performance in short-prompt image generation and unstable text-to-image outputs. Janus-Pro-7B addresses these issues through enhanced training strategies, expanded datasets, and increased model capacity.
The report emphasizes that these upgrades not only improve the model’s text-to-image instruction-following performance but also enhance its stability and scalability, making it a strong contender in the AI-generated imagery space.
More insights
DeepSeek is a relatively new entrant to the AI industry, founded in 2023 by Chinese entrepreneur Liang Wenfeng. Despite being less than two years old, the company has already made a significant impact.
- In January 2024, it released its open-source models for download, where they quickly gained popularity, topping the iPhone app download charts and surpassing OpenAI’s ChatGPT app.
- DeepSeek’s latest reasoning model, R1, has drawn comparisons to top-tier AI products from OpenAI and Meta. The company claims its models are not only competitive in performance but also more efficient and cost-effective to develop.
- One of the standout claims from DeepSeek is the cost of training its models. The company stated that training one of its latest models cost $5.6 million, a figure far below the $100 million to $1 billion estimated by industry leaders for similar projects.
DeepSeek has also highlighted the efficiency of its development process, which reportedly does not rely on the most powerful AI accelerators.