FunAudioChat LogoFunAudioChat
Open Source | 8B Parameters | #1 Benchmarks

Fun Audio Chat
Real-time Speech-to-Speech AI

Experience natural voice conversations with end-to-end speech AI

FunAudioChat is Alibaba's breakthrough model that understands emotions, executes voice commands, and responds naturally - all without traditional ASR+LLM+TTS pipelines.

100% Open Source
#1 on Benchmarks
Emotion Aware
Voice Commands

See FunAudioChat in Action

Watch how our end-to-end speech AI enables natural voice conversations

Why FunAudioChat?

Next-generation voice AI with breakthrough capabilities

End-to-End Architecture

Direct speech-to-speech processing without ASR+LLM+TTS pipeline. Lower latency, more natural conversations.

Dual-Resolution Design

5Hz semantic processing + 25Hz speech generation. ~50% GPU savings while maintaining quality.

Emotion Recognition

Understands emotions from tone, pace, and pauses. Responds with appropriate emotional context.

Speech Function Call

Execute tool calls and commands directly through voice. No text intermediary needed.

Fully Open Source

Complete model weights and code on ModelScope and Hugging Face. Deploy anywhere.

Top Benchmark Results

#1 on OpenAudioBench, MMAU, Speech-ACEBench, VStyle in same-size category.

#1

Benchmark Results

FunAudioChat achieves #1 ranking on multiple benchmarks including OpenAudioBench, MMAU, Speech-ACEBench, and VStyle in the same-size model category (around 8B parameters).

FunAudioChat Benchmark Results - Chart 1
FunAudioChat Benchmark Results - Chart 2
#1OpenAudioBench
#1MMAU
#1Speech-ACEBench
#1VStyle
Demo

Hear the Difference

Listen to real examples of FunAudioChat's capabilities

Voice Empathy

Voice Empathy Example

AI understands emotions and responds empathetically

Voice Instructions

Voice Instruction Following

Adjusts speaking style based on instructions

Function Calling

Voice Function Calling

Execute tool calls through voice commands

Audio Understanding

Audio Understanding

Understand and analyze audio content

How It Works

End-to-end speech processing in three steps

1

Speech Input

Speak naturally - the model captures audio with full emotional context

2

Dual-Resolution Processing

5Hz semantic understanding + 25Hz speech synthesis in a single model

3

Natural Response

Receive emotionally appropriate speech output in real-time

Frequently Asked Questions

Everything you need to know about FunAudioChat

FunAudioChat (Fun-Audio-Chat-8B) is an open-source end-to-end speech-to-speech AI model developed by Alibaba's Tongyi Bailing team. It can understand and respond to voice input directly without needing separate ASR, LLM, and TTS components.
Traditional systems use a pipeline of ASR (speech-to-text) → LLM (text processing) → TTS (text-to-speech). FunAudioChat processes speech end-to-end in a single model, resulting in lower latency, better emotion preservation, and more natural conversations.
FunAudioChat-8B requires a GPU with at least 16GB VRAM for inference. For optimal performance, we recommend 24GB+ VRAM. It supports NVIDIA GPUs with CUDA 11.8+.
Yes! FunAudioChat is fully open source under a permissive license. You can download the model weights from ModelScope or Hugging Face and deploy it on your own infrastructure.
Please refer to the license on the official repository. The model is open source, but commercial use terms may vary. Check the LICENSE file in the GitHub repository for details.
FunAudioChat primarily supports Chinese and English. The model has been trained on multilingual data and can handle both languages fluently.

Ready to Build with FunAudioChat?

Get started in minutes with our open-source model

Fun Audio Chat - Real-time Voice AI with Emotion Recognition