StepFun introduces StepAudio 2.5 TTS, targeting booming AI content market

  • StepAudio 2.5 TTS is designed to let ordinary users easily become voice directors using natural language.
  • The technology roll-out comes as the company weighs a Hong Kong IPO that could raise up to $500 million.
StepFun introduces StepAudio 2.5 TTS, targeting booming AI content market
(A screenshot from a video of StepFun.)

Chinese AI (artificial intelligence) startup StepFun has released its next-generation voice generation model, StepAudio 2.5 TTS, aiming to let ordinary users easily become voice directors through natural language and lowering the barrier to professional audio production.

The model features three core capabilities: global context control, inline context control, and zero-shot voice cloning with full timbre control, according to a statement on Thursday.

This advancement means that AI is no longer just mechanically reading input text, but can deeply perform the text with emotion, much like a professional voice actor.

Instead of relying on traditional preset tags or phrase combinations, users can simply use natural language to precisely adjust the emotion, rhythm, and even the psychological activity of the synthesized voice.

The model is now fully available on StepFun's open platform, providing developers and content creators with scenario-based voice solutions, the statement said.

The launch of the new tech product coincides with a crucial window for the Shanghai-based AI unicorn as it actively prepares for capital market operations.

The company is considering an initial public offering in Hong Kong as soon as this year, which could raise up to around $500 million.

As China pushes to develop its domestic AI industry to compete globally, the company completed a new funding round of over 5 billion yuan ($733 million) in late January.

Chinese AI rivals MiniMax and Zhipu AI debuted on the Hong Kong stock exchange in January, setting new local fundraising records.

Baidu open-sourced its 8-billion-parameter text-to-image model Ernie-Image, which runs smoothly on consumer graphics cards with 24 GB of VRAM.
Apr 15, 2026

($1 = 6.8251 yuan)

AI News Alert
Subscribe to receive email notifications immediately when new articles about AI are published.
AI
View more channels