Review of text-to-speech models for reading research papers

By Joe Golden and Chandradeep Chowdhury · April 21, 2025

Listen to this blog post:

We created an audio version of this blog post using Paper2Audio, which includes a summary of the main table. We manually spliced in the audio outputs from the models tested below. Paper2Audio isn't optimized for web content, but worked well on this blog post because it has a similar enough structure to a typical research paper.

Introduction
Key takeaway
Results summary
Evaluation methodology
Model reviews
Conclusion and recommendations

Introduction

Welcome to our evaluation of text-to-speech (TTS) models, tailored for a specific, demanding use case: narrating research papers and other technical documents. While your own needs might differ, we found that TTS models possess distinct strengths and weaknesses. As such, there's likely no single "best" model for every scenario.

We approach this review much like those found in the world of PC hardware – a domain we're both passionate about. As computer hardware enthusiasts, we've consumed countless detailed reviews featuring extensive evaluations and complex benchmarks. Those reviews excel at helping users match hardware to specific requirements. Similarly, we aim to provide a rigorous, use-case-focused evaluation for TTS models.

Our primary evaluation criteria focus on:

Pronunciation accuracy: This is crucial, especially for technical terms, math, and acronyms frequently encountered in research papers.
Cost: We require a solution that is cost-effective for processing large volumes of text.
Customization: The ability to manually correct pronunciation errors is a significant advantage for handling specialized or novel terminology.

We also assessed several secondary factors:

Voice quality: Voices should sound natural, clear, and engaging enough for long listening sessions. We aimed for voices we subjectively scored 3-5 out of 10 in emotiveness – overly flat voices become monotonous, while overly expressive ones can be fatiguing when listening to research.
Voice variety: We need at least three distinct, high-quality American English voices: one for the main paper text, one for generated summaries, and one for additional generated context.
Ease of evaluation: How straightforward is it to test the model with custom text via demos or APIs? Do the provider's claims hold up under scrutiny?

For our specific application of reading scientific documents, advanced TTS features like fine-grained emotion tags or voice cloning are not relevant.

Key takeaway

Accuracy vs. quality: While many TTS models boast high voice quality, most struggled with accurate pronunciation of technical terms, symbols, and numbers common in research papers. This focus on sounding good often makes for impressive demos but poor products for specialized content. This is particularly true for open-weight models, which often prioritize natural-sounding voices over correctness. Customization is often key to bridging this accuracy gap.

Results summary

Here's a high-level overview of the models we evaluated:

Model	Cost (/1M chars)	Overall (/10)	Energy (/10)	Naturalness (/10)	Accuracy (/10)	Value (/10)
Kokoro 82M 1.0	Self-hosted (~$0.65)***	5	8	7	3	10
Kokoro 82M 1.0 (Modified)	Self-hosted (~$0.65)***	8.5	8	7	9	10
Sesame CSM 1B	Self-hosted (~$5)***	0	-	-	-	-
SparkTTS 0.5B	Self-hosted (~$2.5)***	2	2	3	2	7
Zonos 1.6B 0.1	Self-hosted (~$8)***	1	4	4	0.5	5.5
OpenAI TTS-1	$15	6	7	9	5	3
OpenAI TTS-1 HD	$30	5	7	9	5	1
OpenAI GPT-4o mini TTS	~$42**	5	7	9	8	0.5
Amazon Polly Generative	$30	4.5	7	6	5.5	0.5
Amazon Polly Neural	$16	5	6.5	4	5.5	2.5
Cartesia Sonic 2.0	~$38*	3	5.5	6	1	0

Notes on Costs & Scores:

* Cartesia Sonic 2.0: Costs vary by subscription tier ($37-$39/1M characters). We used the average ($38).
** GPT-4o mini TTS: Cost is based on OpenAI's estimated $0.030/minute rate. We calculated cost per 1M characters assuming an average speaking rate of 150 words/minute and 4.7 characters/word (approx. 700 characters/minute), resulting in ~$42/1M characters. We prefer per-character pricing for easier cost estimation, as offered by most providers (including OpenAI's previous models).
*** Self-hosted Costs: These are estimates for medium-volume usage (defined as 10M characters daily with non-overlapping requests, incurring GPU warm-up/cooldown penalties per request). At low volumes, overheads can increase costs. At very high volumes with continuous utilization, costs may decrease significantly below these estimates.
Kokoro Score: The base score reflects out-of-the-box performance. The higher scores (8.5 overall and 9 accuracy) reflect performance after applying custom pronunciation fixes.

Score descriptions (out of 10):
Overall: A holistic rating considering all factors, heavily weighted towards accuracy, cost, and customizability for our use case. It is not a simple average or formula based on the other scores.
Energy: How energetic and engaging the voice sounds (higher is more energetic).
Naturalness: How human-like and free of robotic artifacts or unnatural sounds the voice is (higher is more natural).
Accuracy: How well the model pronounces technical terms, symbols, numbers, and handles formatting cues in our torture test (higher is more accurate).
Value: An assessment of the cost-effectiveness, balancing the price against the overall performance and features delivered (higher is better value).

Evaluation methodology

To rigorously assess the models, we developed a "torture test" paragraph containing expressions we found to be challenging for TTS models, particularly within the context of research papers.

The torture test

This input text probes pronunciation of technical terms, symbols, numbers, acronyms, and formatting cues:

There are hard to pronounce phrases, e.g. (i) We use ArXiv and LaTeX (ii) It cost $5.6 million (iii) Json != xml; also (iv) Example vector: (x_1,...,x_2) (v) We have some RECOMMENDATIONS (i.e. suggestions) and (6) During 2010-2018. Figure 2a: It took us 16 NVIDIA gpus, and 13.7 hrs 14 mins. Consider a set A, where a equals 2 times a.

Evaluation procedure

Our evaluation followed these steps for each model:

Voice selection: We primarily reviewed the female voice most prominently featured by the provider. If unsuitable (e.g., overly emotive), we chose an alternative female voice. We also conducted a limited review of one male and one additional female voice per model where available and seemingly appropriate.
Audio generation: We generated audio for the torture test text using the selected voices at standard speed (1x).
Listening tests: We listened to the generated audio at normal speed (1x), half speed (0.5x), and double speed (2x).
Issue enumeration: We meticulously documented inaccuracies, including:
- Minor and major mispronunciations.
- Incorrect handling of symbols (e.g., !=, ...), numbers (currency, ranges, decimals), Roman numerals ((i)), and acronyms (gpu).
- Timing issues (awkward pauses, unnatural pacing).
- Robotic artifacts or unnatural sounds.
- Consistency problems.
Subjective quality assessment: We rated the overall subjective quality and listenability of the voice.
Ease of evaluation: We assessed the simplicity of testing the model with custom text (e.g., web playgrounds, API access).
Pronunciation customization: We investigated whether and how users can fix or customize pronunciations.
Provider claims: We noted any discrepancies between provider claims (website, docs) and our findings.
Scoring: We assigned an overall score (0-10) based on all factors, with the heaviest weighting given to cost, pronunciation accuracy, and customization potential.

Model selection

We chose models based on popularity, apparent relevance to technical narration, and insights from prior internal evaluations. This resulted in a mix of leading closed-source APIs and prominent open-weight models.

Models reviewed:

Kokoro 82M 1.0
Sesame CSM 1B
SparkAudio SparkTTS 0.5B
Zyphra Zonos 1.6B 0.1
OpenAI TTS-1 and TTS-1 HD
OpenAI GPT-4o mini TTS
Amazon Polly (Generative & Neural)
Cartesia Sonic 2.0

Model reviews

Kokoro 82M 1.0, heart voice

Score: 5 / 10 (Out-of-the-box), 8.5 / 10 (With modifications)
License: Apache 2.0 (Open-weight)
Weights: 82M parameters
Link: Kokoro model card
Playground: Kokoro Hugging Face space

Kokoro is an 82M parameter open-weight model.

Audio sample (heart voice):

Review of text-to-speech models for reading research papers

Table of contents

Introduction

Key takeaway

Results summary

Evaluation methodology

The torture test

Evaluation procedure

Model selection

Model reviews

Kokoro 82M 1.0, heart voice

Sesame CSM 1B, female voice

SparkAudio SparkTTS 0.5B, female voice

Zyphra Zonos 1.6B 0.1

OpenAI TTS-1 and TTS-1 HD, alloy voice

OpenAI GPT-4o mini TTS, alloy voice

Amazon Polly (Generative & Neural), ruth voice

Cartesia Sonic 2.0, sophie voice

Conclusion and recommendations

Links to further resources

Changelog