- DeepSeek rolled out its image recognition mode to all users, officially entering the era of image-text interaction.
- Its highly efficient visual framework significantly reduces computing power consumption, paving the way for the release of the V4.1 model update in June.

China's leading artificial intelligence (AI) startup DeepSeek has rolled out its "image recognition mode" to all users, marking the company's official entry into the era of image-text interaction.
This major technological breakthrough coincides with the company's push for a record-breaking fundraising round. It plans to raise up to 50 billion yuan ($7.35 billion), according to a report by The Information on Friday.
The massive potential deal is expected to propel its valuation to roughly $45 billion and further accelerate its revenue generation and overall commercialization pace.
The image recognition mode began beta testing last month, and currently, almost all test accounts can access the feature.
Its actual capabilities extend far beyond traditional, simple text extraction. Real-world tests show the model can not only accurately identify artifacts from the Qianlong reign of the Qing Dynasty, but also parse complex webpage screenshots and reverse-engineer interactive HTML code with a single click.
When faced with highly difficult spatial reasoning tasks or popular internet meme images, it also demonstrates solid logical reasoning and precise emotional perception.
This new feature is primarily based on a core multimodal framework called "Thinking with Visual Primitives," which significantly enhances the large model's spatial awareness of complex scenes.
The framework directly integrates visual elements, such as spatial locations, into the reasoning chain, solving the logical dilemmas that traditional multimodal models face in complex spatial layouts.
More importantly, this innovative architecture is extremely friendly to computing resources, consuming only about 90 tokens when processing an 800x800 resolution image.
In contrast, other mainstream large models like GPT typically require about 870 to 1,100 tokens to process an image of the same resolution.
Despite the stunning technical performance, comprehensive feedback from massive user testing indicates that the newly launched image recognition mode still has several obvious shortcomings.
Because its knowledge base stops at 2025, the model is prone to making misattributions when identifying the latest electronic products due to a severe lag in information.
Furthermore, when dealing with counter-intuitive graphics such as optical illusions, prolonged deep thinking can sometimes lead to severe logical hallucinations and uncertainty in the answers.
It should be noted that the current mode is strictly limited to pure visual understanding and has not yet integrated broader multimodal interactive features like image generation or video analysis.
The comprehensive expansion of its visual technology capabilities comes at a critical historical juncture as DeepSeek accelerates its commercialization and seeks to consolidate its industry-leading position.
To align its product release cadence more closely with current industry standards, the company has informed investors of its plan to officially launch the V4.1 model update in June, according to The Information.