ConvoBench

Voice Agent Paper Hunt

Discover Voice Agent Research

Curated collection of papers on Conversational AI, Voice Agents, Speech LLMs, and Real-time Voice Interaction

20
Papers
10
Benchmarks
14
2024+ Papers
7,008
Citations
Showing 20 of 20 papers
Landmark

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, et al.ICML20232100 citations

Whisper is trained on 680,000 hours of multilingual data, achieving robust speech recognition that generalizes well across domains and languages.

ASRfoundation-modelOpenAI
arXiv
Landmark

Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone

Yaniv Leviathan, Yossi MatiasGoogle AI Blog20181250 citations

Google Duplex uses a recurrent neural network to conduct natural-sounding conversations over the phone for tasks like making restaurant reservations.

conversational-AIreal-worldGoogle
Link

SUPERB: Speech Processing Universal PERformance Benchmark

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, et al.INTERSPEECH2021892 citations

A benchmark for evaluating speech processing capabilities across critical tasks like Automatic Speech Recognition, Keyword Spotting, Speaker Identification, Intent Classification, and Emotion Recognition.

benchmarkASRfoundation-model
arXiv
Landmark

GPT-4o: Omni-Modal Foundation Model

OpenAIOpenAI Blog2024890 citations

GPT-4o achieves human-like 232ms response latency for audio input, enabling natural real-time voice conversations with full-duplex capabilities.

multimodalreal-timeOpenAI
Link

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

Meta AIarXiv2023456 citations

A foundational multilingual and multitask model that supports near-100 languages for speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation.

multilingualmultimodaltranslation
arXiv

MT-Bench: Multi-Turn Benchmark for LLM Conversation

Various AuthorsarXiv2023423 citations

MT-Bench assesses LLMs in multi-turn dialogues, focusing on their capacity to maintain context and demonstrate reasoning skills across eight categories.

benchmarkmulti-turnLLM
arXiv

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

LMSYSarXiv2024234 citations

Chatbot Arena offers an open environment for evaluating LLMs based on human preferences through pairwise comparisons.

evaluationhuman-preferenceLLM
arXiv

Moshi: A Full-Duplex Speech-to-Speech Model

Kyutai LabsarXiv2024156 citations

Moshi enables simultaneous listening and speaking (full-duplex), processing speech directly without text intermediaries, achieving natural turn-taking.

full-duplexspeech-to-speechreal-time
arXiv

SLUE: Spoken Language Understanding Evaluation

Shang-Wen Li, Suwon Shon, Hao Tang, et al.ASRU2021156 citations

A benchmark suite covering tasks like Named Entity Recognition, Sentiment Analysis, and Automatic Speech Recognition for advancing conversational AI.

benchmarkSLUNER
arXiv

LLaMA-Omni: Seamless Speech Interaction with LLMs

Various AuthorsarXiv202489 citations

LLaMA-Omni is built on LLaMA-3.1-8B for low-latency, high-quality speech interaction, generating speech responses directly from speech instructions.

speech-to-speechLLMreal-time
arXiv

Survey on Recent Advances in Speech Language Models

Various AuthorsarXiv202489 citations

A comprehensive survey reviewing methodologies, architectural components, training approaches, and evaluation metrics for Speech Language Models.

surveySpeechLMmethodology
arXiv

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

Various AuthorsNAACL202467 citations

DialogBench evaluates LLMs based on their ability to act as human-like dialogue systems, comprising 12 distinct dialogue tasks using GPT-4 generated evaluation instances.

benchmarkdialogueLLM
arXiv

VoiceAssistant-Eval: A Comprehensive Benchmark for AI Assistants

Various AuthorsarXiv202445 citations

A comprehensive benchmark comprising 10,497 curated examples spanning 13 task categories including natural sounds, music, spoken dialogue, multi-turn dialogue and role-play imitation.

benchmarkevaluationmultimodal
arXiv

Sparrow-1: Multilingual Audio Model for Real-Time Conversational Flow

Various AuthorsarXiv202434 citations

Sparrow-1 focuses on real-time conversational flow and 'floor transfer,' predicting when a system should listen, wait, or speak to mimic human conversation timing.

real-timeturn-takingmultilingual
arXiv

VocalBench: Benchmarking Vocal Conversational Abilities

Various AuthorsarXiv202431 citations

A benchmark designed to assess speech conversational abilities using 9,400 instances across semantic quality, acoustic performance, conversational abilities, and robustness.

benchmarkconversationspeech
arXiv

MiniMax Speech 2.5: Sub-250ms End-to-End Voice AI

MiniMaxMiniMax Blog202428 citations

MiniMax Speech 2.5 achieves end-to-end latency under 250 milliseconds, enabling truly real-time voice interactions.

latencyreal-timeTTS
Link

SOVA-Bench: Evaluating Generative Speech LLMs and Voice Assistants

Various AuthorsarXiv202423 citations

An evaluation system for generative speech LLMs that quantifies performance in general knowledge and the ability to recognize, understand, and generate speech flow.

benchmarkLLMvoice-assistant
arXiv

VoiceAgentEval: Evaluating LLMs for Expert-Level Outbound Calling

Various AuthorsarXiv202418 citations

A benchmark for evaluating LLMs in expert-level intelligent outbound calling scenarios with user simulation and dynamic evaluation methods.

benchmarkvoice-agentLLM
arXiv

SpeechR: Benchmarking Speech Reasoning in Large Audio-Language Models

Various AuthorsarXiv202415 citations

A benchmark to evaluate speech reasoning capabilities of large audio-language models in factual, procedural, and normative tasks.

reasoningaudio-LMbenchmark
arXiv

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

Various AuthorsarXiv202512 citations

A benchmark for end-to-end SpeechLLMs that addresses limitations of existing evaluations and provides comprehensive assessment in real-world speech interactions.

benchmarkSpeechLLMreal-world
arXiv