Review of text-to-speech models for reading research papers

By Joe Golden and Chandradeep Chowdhury · April 21, 2025

Review of text-to-speech models for reading research papers

Listen to this blog post:

Download

We created an audio version of this blog post using Paper2Audio, which includes a summary of the main table. We manually spliced in the audio outputs from the models tested below. Paper2Audio isn't optimized for web content, but worked well on this blog post because it has a similar enough structure to a typical research paper.

Table of contents

Introduction

Welcome to our evaluation of text-to-speech (TTS) models, tailored for a specific, demanding use case: narrating research papers and other technical documents. While your own needs might differ, we found that TTS models possess distinct strengths and weaknesses. As such, there's likely no single "best" model for every scenario.

We approach this review much like those found in the world of PC hardware – a domain we're both passionate about. As computer hardware enthusiasts, we've consumed countless detailed reviews featuring extensive evaluations and complex benchmarks. Those reviews excel at helping users match hardware to specific requirements. Similarly, we aim to provide a rigorous, use-case-focused evaluation for TTS models.

Our primary evaluation criteria focus on:

  1. Pronunciation accuracy: This is crucial, especially for technical terms, math, and acronyms frequently encountered in research papers.

  2. Cost: We require a solution that is cost-effective for processing large volumes of text.

  3. Customization: The ability to manually correct pronunciation errors is a significant advantage for handling specialized or novel terminology.

We also assessed several secondary factors:

  • Voice quality: Voices should sound natural, clear, and engaging enough for long listening sessions. We aimed for voices we subjectively scored 3-5 out of 10 in emotiveness – overly flat voices become monotonous, while overly expressive ones can be fatiguing when listening to research.

  • Voice variety: We need at least three distinct, high-quality American English voices: one for the main paper text, one for generated summaries, and one for additional generated context.

  • Ease of evaluation: How straightforward is it to test the model with custom text via demos or APIs? Do the provider's claims hold up under scrutiny?

For our specific application of reading scientific documents, advanced TTS features like fine-grained emotion tags or voice cloning are not relevant.

Key takeaway

Accuracy vs. quality: While many TTS models boast high voice quality, most struggled with accurate pronunciation of technical terms, symbols, and numbers common in research papers. This focus on sounding good often makes for impressive demos but poor products for specialized content. This is particularly true for open-weight models, which often prioritize natural-sounding voices over correctness. Customization is often key to bridging this accuracy gap.

Results summary

Here's a high-level overview of the models we evaluated:

Model Cost (/1M chars) Overall (/10) Energy (/10) Naturalness (/10) Accuracy (/10) Value (/10)
Kokoro 82M 1.0 Self-hosted (~$0.65)*** 5 8 7 3 10
Kokoro 82M 1.0 (Modified) Self-hosted (~$0.65)*** 8.5 8 7 9 10
Sesame CSM 1B Self-hosted (~$5)*** 0 - - - -
SparkTTS 0.5B Self-hosted (~$2.5)*** 2 2 3 2 7
Zonos 1.6B 0.1 Self-hosted (~$8)*** 1 4 4 0.5 5.5
OpenAI TTS-1 $15 6 7 9 5 3
OpenAI TTS-1 HD $30 5 7 9 5 1
OpenAI GPT-4o mini TTS ~$42** 5 7 9 8 0.5
Amazon Polly Generative $30 4.5 7 6 5.5 0.5
Amazon Polly Neural $16 5 6.5 4 5.5 2.5
Cartesia Sonic 2.0 ~$38* 3 5.5 6 1 0

Notes on Costs & Scores:

  • * Cartesia Sonic 2.0: Costs vary by subscription tier ($37-$39/1M characters). We used the average ($38).

  • ** GPT-4o mini TTS: Cost is based on OpenAI's estimated $0.030/minute rate. We calculated cost per 1M characters assuming an average speaking rate of 150 words/minute and 4.7 characters/word (approx. 700 characters/minute), resulting in ~$42/1M characters. We prefer per-character pricing for easier cost estimation, as offered by most providers (including OpenAI's previous models).

  • *** Self-hosted Costs: These are estimates for medium-volume usage (defined as 10M characters daily with non-overlapping requests, incurring GPU warm-up/cooldown penalties per request). At low volumes, overheads can increase costs. At very high volumes with continuous utilization, costs may decrease significantly below these estimates.

  • Kokoro Score: The base score reflects out-of-the-box performance. The higher scores (8.5 overall and 9 accuracy) reflect performance after applying custom pronunciation fixes.

  • Score descriptions (out of 10):

  • Overall: A holistic rating considering all factors, heavily weighted towards accuracy, cost, and customizability for our use case. It is not a simple average or formula based on the other scores.

  • Energy: How energetic and engaging the voice sounds (higher is more energetic).

  • Naturalness: How human-like and free of robotic artifacts or unnatural sounds the voice is (higher is more natural).

  • Accuracy: How well the model pronounces technical terms, symbols, numbers, and handles formatting cues in our torture test (higher is more accurate).

  • Value: An assessment of the cost-effectiveness, balancing the price against the overall performance and features delivered (higher is better value).

Evaluation methodology

To rigorously assess the models, we developed a "torture test" paragraph containing expressions we found to be challenging for TTS models, particularly within the context of research papers.

The torture test

This input text probes pronunciation of technical terms, symbols, numbers, acronyms, and formatting cues:

There are hard to pronounce phrases, e.g. (i) We use ArXiv and LaTeX (ii) It cost $5.6 million (iii) Json != xml; also (iv) Example vector: (x_1,...,x_2) (v) We have some RECOMMENDATIONS (i.e. suggestions) and (6) During 2010-2018. Figure 2a: It took us 16 NVIDIA gpus, and 13.7 hrs 14 mins. Consider a set A, where a equals 2 times a.

Evaluation procedure

Our evaluation followed these steps for each model:

  1. Voice selection: We primarily reviewed the female voice most prominently featured by the provider. If unsuitable (e.g., overly emotive), we chose an alternative female voice. We also conducted a limited review of one male and one additional female voice per model where available and seemingly appropriate.

  2. Audio generation: We generated audio for the torture test text using the selected voices at standard speed (1x).

  3. Listening tests: We listened to the generated audio at normal speed (1x), half speed (0.5x), and double speed (2x).

  4. Issue enumeration: We meticulously documented inaccuracies, including:

    • Minor and major mispronunciations.

    • Incorrect handling of symbols (e.g., !=, ...), numbers (currency, ranges, decimals), Roman numerals ((i)), and acronyms (gpu).

    • Timing issues (awkward pauses, unnatural pacing).

    • Robotic artifacts or unnatural sounds.

    • Consistency problems.

  5. Subjective quality assessment: We rated the overall subjective quality and listenability of the voice.

  6. Ease of evaluation: We assessed the simplicity of testing the model with custom text (e.g., web playgrounds, API access).

  7. Pronunciation customization: We investigated whether and how users can fix or customize pronunciations.

  8. Provider claims: We noted any discrepancies between provider claims (website, docs) and our findings.

  9. Scoring: We assigned an overall score (0-10) based on all factors, with the heaviest weighting given to cost, pronunciation accuracy, and customization potential.

Model selection

We chose models based on popularity, apparent relevance to technical narration, and insights from prior internal evaluations. This resulted in a mix of leading closed-source APIs and prominent open-weight models.

Models reviewed:

Model reviews


Kokoro 82M 1.0, heart voice

Score: 5 / 10 (Out-of-the-box), 8.5 / 10 (With modifications)
License: Apache 2.0 (Open-weight)
Weights: 82M parameters
Link: Kokoro model card
Playground: Kokoro Hugging Face space

Kokoro is an 82M parameter open-weight model.

Audio sample (heart voice):

Download

Accuracy issues (heart voice, out-of-the-box):

  • Roman numerals (ii), (iii), (iv) pronounced literally ("Roman 2", "Roman 3", "Roman 4"). (v) pronounced as the letter "v".

  • ArXiv mispronounced ("Arziv").

  • != read as "equals" (should be "not equals").

  • (x_1,...,x_2) read as "x 1 x 2" (lacks subscript/range indication).

  • 2018 in 2010-2018 read as "2 0 1 8".

  • gpus mispronounced ("g-pas").

  • hrs read as letters ("h-r-s").

  • a skipped in "where a equals".

Overall evaluation:

Kokoro's default Heart voice is pleasant, very clear, and possesses a moderate energy level well-suited for research paper narration. It avoids robotic tones and generally has natural pacing, performing well at 0.5x and 2x speed.

However, its out-of-the-box accuracy suffers from significant issues, particularly with Roman numerals, the inequality symbol, and the skipped 'a', making it problematic for technical content without intervention.

Kokoro's major strength lies in its customizability and efficiency. As a small 82M parameter model, it's extremely cost-effective to self-host, even potentially running on local GPUs or CPUs. It utilizes a partially rule-based grapheme-to-phoneme (G2P) engine (via its sister project Misaki), allowing users to define custom pronunciations. We found this process relatively straightforward and were able to correct nearly all the identified accuracy issues. This customization dramatically improved its suitability for our use case, justifying the significantly higher "modified" score of 8.5/10.

Audio sample (heart voice, modified with custom pronunciations):

Download

Evaluation was easy via the Hugging Face space demo, though we experienced occasional reliability issues (errors, slow generation) for all demos using Hugging Face spaces. We found no misleading claims in Kokoro's documentation.

Evaluation of other voices (bella, echo): Bella (female) offers a similarly pleasant and clear voice with moderate energy, distinct from Heart. Echo (male) provides comparable quality and energy. All voices share the same underlying accuracy issues due to the shared G2P engine, but are correctable. We slightly prefer Heart, but Bella and Echo are solid alternatives.

(Audio samples for bella and echo)

Download
Download

Sesame CSM 1B, female voice

Score: 0 / 10
License: Apache 2.0 (Open-weight)
Weights: 1B parameters
Links: Sesame CSM announcement, CSM 1B model card
Playgrounds: Replicate, fal.ai

CSM (Conversational Speech Model) is a 1B parameter open-weight model from Sesame, using a Llama backbone and an audio decoder.

Audio sample (female voice, speaker 0):

Download

Overall evaluation:

The default female voice (speaker 0) exhibited severe slurring, resulting in poor pronunciation accuracy. After just a few seconds, the output became completely unintelligible. Listening at different speeds (0.5x, 2.0x) only exacerbated the problem.

We tested across multiple community-hosted playgrounds (Replicate, fal.ai) to rule out implementation issues, obtaining similarly poor results. This is concerning, as the demos in Sesame's announcement post sound significantly better. If this discrepancy stems from improper inference setups, we urge Sesame to provide an official playground or improve setup guidance for community hosts.

While the model's open nature and moderate size (1B parameters is large for TTS, but manageable) allow for self-hosting at a moderate cost, the fundamental quality issues render it unusable for our purposes in its current tested state, hence the zero score. Evaluation via community playgrounds was easy and reliable (more so than Hugging Face Spaces in our tests).

Evaluation of male voice (speaker 1): The male voice (speaker 1) initially sounded slightly better but also degraded similarly after a few seconds. The limited voice selection is also a drawback.

Audio sample (male voice, speaker 1):

Download

SparkAudio SparkTTS 0.5B, female voice

Score: 2 / 10
License: Apache 2.0 (Open-weight)
Weights: 0.5B (500M) parameters
Link: Spark model card
Playground: Spark Hugging Face space

SparkTTS is a 0.5B parameter open-weight model built on Qwen2.5, featuring zero-shot voice creation and cloning.

Audio sample (female voice, default):

Download

Accuracy issues (female voice, default settings, voice creation mode):

  • ArXiv mispronounced ("arsiv").

  • $5.6 million read incorrectly ("five dollars six million").

  • Json != xml sounded unintelligible.

  • (x_1,...,x_2) skipped the first term ("x_1").

  • gpus mispronounced ("g-pus").

  • 13.7 hrs sounded unintelligible.

  • mins pronounced literally ("mins," not "minutes").

Overall evaluation:

The default female voice lacked energy, making it unsuitable for engaging narration. While the model offers voice creation (adjusting pitch/speed) and cloning features (which we didn't test extensively), the baseline quality was low. Listening at 0.5x worsened the low-energy feel, while 2.0x offered slight improvement.

More critically, SparkTTS exhibited moderate accuracy issues, with several key phrases becoming completely unintelligible (Json != xml, 13.7 hrs). This level of inaccuracy makes it unsuitable for reliable technical content narration.

As a smaller 500M parameter model, SparkTTS is inexpensive to self-host and potentially usable on local GPUs. Its voice creation/cloning features might appeal to hobbyists. However, its core performance on technical text is currently unacceptable for our needs. Evaluation via the official Hugging Face space was straightforward, and we found no misleading claims.

Evaluation of male voice (default): The default male voice was more energetic and avoided being completely unintelligible. However, it suffered from a similar number of accuracy issues, though different ones (e.g., it read $5.6 million correctly but pronounced 2018 as twenty-eight). Though there are only two default voices, we are not docking points for voice selections due to available customizations.

Audio sample (male voice, default):

Download

Zyphra Zonos 1.6B 0.1

Score: 1 / 10
License: Apache 2.0 (Open-weight)
Weights: 1.6B parameters
Links: Zonos release post, Transformers model card, Hybrid model card
Playgrounds: ZonosTTS, Replicate, Official Playground

Zonos 0.1 is a 1.6B parameter open-weight model focused on high-fidelity voice cloning, offered in "transformer" and "hybrid" (rule-based G2P + transformer) versions.

Audio sample (female voice, transformers):

Download

Accuracy issues (female voice, transformers, default settings):

  • (i) read as "I".

  • ArXiv mispronounced ("Arziv").

  • LaTeX severely mispronounced ("Lettuce").

  • (ii) It cost $5.6 million read nonsensically ("Roman e six cents million").

  • != read as "equals".

  • (iv) skipped entirely.

  • (v) read as letters ("v i").

  • gpus mispronounced ("g-po-es").

  • hrs, mins read as letters ("H-R-S", "M-I-N-S").

  • a equals 2 times a final a skipped (potentially due to playground 30-second generation limit).

Overall evaluation:

The default female voice (transformers version) is reasonably clear but includes subtle, distracting breathing artifacts. It could benefit from more energy. While voice cloning (Zonos's main feature) might improve subjective quality, it cannot fix the core problems. Listening at 0.5x/2.0x didn't introduce new issues but didn't help the base quality either.

Zonos suffers from severe accuracy issues when reading technical terms, symbols, and numbers, rendering it unsuitable for our research paper narration use case. The available voice cloning customization does not address these fundamental pronunciation flaws.

At 1.6B parameters, Zonos is relatively large for TTS, requiring mid-range to high-end GPUs for local use and incurring higher cloud hosting costs compared to sub-1B models. While advertised for voice cloning (and cloned examples sound better), the poor baseline performance is a major drawback.

Evaluation was easy thanks to multiple available official and unofficial playgrounds. We found no misleading claims. The release announcement has voice cloning examples which sound much better than the default voice, however we did not feel misled as the post clearly emphasizes that this model is designed for voice cloning.

Evaluation of male voice (transformers): The male voice exhibited similar quality to the female voice. However, it had a very low pitch and featured jarring, inconsistent pauses instead of breathing artifacts. It also suffered from the same accuracy issues.

Audio sample (male voice, transformers):

Download

Evaluation of hybrid voices: We also tested the hybrid model's voices. Accuracy issues persisted. The female hybrid voice was slightly clearer than its transformers counterpart (no breathing sounds), but the male hybrid voice had even worse pausing issues.

(Audio samples for hybrid voices)

Download
Download

OpenAI TTS-1 and TTS-1 HD, alloy voice

Score: 6 / 10 (TTS-1), 5 / 10 (TTS-1 HD)
License: Proprietary (Closed source, paid API)
Links: tts-1 docs, tts-1-hd docs
Playground: Official playground

TTS-1 and TTS-1 HD are OpenAI's transformer-based TTS models. TTS-1 HD is positioned as higher quality, while TTS-1 is optimized for lower latency.

Crucially, for our specific use case, we found no discernible quality difference between TTS-1 and TTS-1 HD. Consequently, the higher price of TTS-1 HD ($30/1M chars vs. $15/1M chars for TTS-1) resulted in a lower score. The following evaluation applies to both, using TTS-1 audio unless noted.

Audio sample (alloy voice, TTS-1):

Download

Audio sample (alloy voice, TTS-1 HD):

Download

Accuracy issues (alloy voice):

  • (i) read as capital "I".

  • (ii) and (iii) were skipped entirely.

  • ArXiv mispronounced ("Arziv").

  • LaTeX mispronounced ("Lay-teks").

  • != read as "equals".

  • (iv) read as letters ("I-V").

  • hrs mispronounced ("h-o-r-s").

  • mins read as letters ("m-i-n-s").

Overall evaluation:

OpenAI's Alloy voice is very clear, naturally paced, and free of noticeable artifacts – a pleasant listening experience. It has slightly lower energy and pitch compared to Kokoro's Heart, making it perhaps marginally less engaging for very long sessions, but still very good. Performance at 0.5x and 2.0x was good.

Accuracy was significantly better than the open-weight models tested, though notable issues remain (skipped Roman numerals, mispronounced terms, symbol errors). This makes it somewhat suitable for technical documents out-of-the-box, but imperfections persist.

As closed-source models accessible only via API, cost is a major factor. At $15/1M chars (TTS-1) and $30/1M chars (TTS-1 HD), they are expensive compared to self-hosted options, though competitive with some other high-end APIs. The lack of customization options to fix remaining errors is also a drawback for specialized content. Evaluation via OpenAI's API playground was simple. We found no misleading claims.

Evaluation of Echo and Nova voices: Echo offers slightly more energy than Alloy. Nova sits between Alloy and Echo in energy, with a higher pitch. Both, along with other standard OpenAI voices, deliver comparable high quality, differing primarily in pitch and energy levels.

Audio sample (echo voice):

Download

Audio sample (echo voice, hd):

Download

Audio sample (nova voice):

Download

Audio sample (nova voice, hd):

Download

OpenAI GPT-4o mini TTS, alloy voice

Score: 5.0 / 10
License: Proprietary (Closed source, paid API)
Link: gpt-4o-mini-tts docs
Playground: OpenAI.fm playground

This is OpenAI's newer generation TTS model, leveraging the GPT-4o mini backbone, distinct from the earlier TTS-1 family.

Audio sample (alloy voice):

Download

Accuracy issues (alloy voice):

  • (i) was skipped.

  • $ symbol was skipped (read as "five point six million").

  • (iv) was skipped.

  • (v) read as letters ("v-i").

  • ... was skipped.

Overall evaluation:

The Alloy voice remains consistent with the TTS-1 version: high quality, very clear, though still slightly lower energy than ideal for maximum engagement. Performance at 0.5x and 2.0x was good.

The significant change is improved accuracy. The GPT-4o mini backbone demonstrably handles many technical elements better than TTS-1. While some issues persist (particularly skipping certain symbols/numerals), it represents a notable step up in out-of-the-box performance for technical content.

However, this improvement comes at a steep cost. OpenAI shifted to per-minute pricing ($0.015/min input, $0.015/min output). Based on our estimate of ~700 characters/minute, this translates to roughly $42/1M characters, significantly more expensive than even TTS-1 HD. This high price heavily impacts the score, despite the quality improvements. Like TTS-1, it lacks user customization for remaining errors.

OpenAI released a public playground for this model making it easy to test without having to sign up for the OpenAI API. We did not find any inaccurate or misleading claims in the model pages.

Evaluation of Echo and Nova voices: Echo, Nova, and other voices (including new ones in this model) maintain the high quality seen in TTS-1, primarily varying in gender, pitch, and energy. Echo remains slightly more energetic than Alloy; Nova is energetic with a higher pitch.

Audio sample (echo voice):

Download

Audio sample (nova voice):

Download

Amazon Polly (Generative & Neural), ruth voice

Score: 4.5 / 10 (Generative), 5 / 10 (Neural)
License: Proprietary (Closed source, paid API via AWS)
Links: Generative docs, Neural docs
Playground: AWS Polly console (Requires AWS account)

Amazon Polly offers several TTS tiers. We focused on the Generative (transformer-based, $30/1M chars) and Neural (presumably hybrid, $16/1M chars) models, as the older "Standard" voices are too robotic for modern standards, and the "Long-Form" model is prohibitively expensive.

Audio sample (ruth voice, Generative):

Download

Accuracy issues (ruth voice, identical for both Generative and Neural):

  • (i) read as capital "I".

  • LaTeX mispronounced ("Lateks").

  • != read as "equals".

  • (v) read as letters ("vi").

  • gpus mispronounced ("g-p-o-c").

  • a equals 2: "a" was skipped.

Overall evaluation:

Polly's Ruth voice (Generative version) is very clear, well-balanced in pitch, and possesses a good energy level – slightly less energetic than Kokoro Heart but brighter than OpenAI Alloy. It performed well at 0.5x and 2.0x speeds.

Accuracy is comparable to OpenAI's TTS-1 models – better than the open-weight options but still containing notable errors for technical content. The Generative model's high cost ($30/1M chars) significantly impacts its score, placing it on par with the less performant (in our tests) TTS-1 HD.

The Neural version offers identical accuracy on our test but uses voices that sound slightly more monotonous and less emotive than their Generative counterparts (even when using the same name, like "Ruth"). However, at nearly half the price ($16/1M chars), the Neural model presents a better value proposition for technical narration where high expressiveness isn't paramount, hence its slightly higher score despite the less engaging voice quality.

Both are closed-source models. Evaluation requires a full AWS account setup, creating friction compared to providers with public playgrounds. For existing AWS users, testing is straightforward via the console. We found no misleading claims in the documentation.

Evaluation of Matthew and Danielle voices (Generative & Neural): Matthew (male) and Danielle (female) (Generative) are similarly clear and energetic to Ruth. Danielle felt slightly less crisp than Ruth but remains a good option. As with Ruth, their Neural counterparts sound flatter but maintain the same accuracy. All Polly voices tested felt fairly similar, primarily differing by gender and pitch.

(Audio samples for matthew and danielle, Generative & Neural)

Audio sample (matthew voice, generative):

Download

Audio sample (danielle voice, generative):

Download

Audio sample (ruth voice, neural):

Download

Audio sample (matthew voice, neural):

Download

Audio sample (danielle voice, neural):

Download

Cartesia Sonic 2.0, sophie voice

Score: 3 / 10
License: Proprietary (Closed source, paid API)
Link: Cartesia Sonic landing page
Playground: Official playground

Sonic 2.0 is Cartesia's flagship TTS model, based on a state-space model architecture.

Audio sample (sophie voice):

Download

Accuracy issues (sophie voice):

  • (i) read as capital "I".

  • ArXiv mispronounced ("Arksiv").

  • LaTeX mispronounced ("Lateks").

  • (ii), (iii) read as letters ("e-e", "e-e-e").

  • != pronounced as "nt equal sign" (the "not" is unclear).

  • (iv) read as letters ("e-v").

  • 2010, 2018 pronounced as individual numbers ("2 0 1 0", "2 0 1 8").

  • gpus mispronounced ("g-pus").

  • hrs read as letters ("h-r-s").

  • mins read literally ("mins").

  • a equals 2: The final "a" was skipped.

Overall evaluation:

Cartesia's Sophie voice is clear, expressive, and has a good energy level. Some pauses felt slightly too long, but this could potentially be mitigated by faster playback speeds. Performance at 0.5x and 2.0x was good.

Unfortunately, Sonic 2.0 exhibited numerous accuracy issues on our technical torture test, performing worse than other major closed-source competitors like OpenAI and Amazon Polly.

Cartesia's pricing is tiered and credit-based, making direct comparison slightly complex, but our estimate puts it around $38/1M characters – among the most expensive options. The combination of high cost and relatively poor technical accuracy significantly lowers its score for our use case.

Evaluation was easy via their free trial credits and web playground. Cartesia's marketing claims (e.g., preference over competitors) might hold true for general-purpose text due to the pleasant voice quality, but our tests show significant room for improvement in handling specialized technical content.

Evaluation of Ethan and Brooke voices: Ethan (male) was the best American male voice we found but felt slightly less clear than Sophie, with subtle breathing artifacts noticeable at high volume. Brooke (female) was also less clear than Sophie but lacked other artifacts. Both are decent secondary voice options but shared the exact same accuracy issues as Sophie.

Audio sample (ethan voice):

Download

Audio sample (brooke voice):

Download

Conclusion and recommendations

Our goal was to identify the best TTS models for the specific, challenging task of narrating research papers, prioritizing pronunciation accuracy (especially for technical content), cost-effectiveness, and pronunciation customization. Unsurprisingly, no single model excelled across all criteria; the optimal choice depends heavily on individual needs and resources.

Inspired by rigorous computer hardware benchmarking, we found that general demos and limited benchmarks often mask weaknesses exposed by our targeted "torture test." This highlights the critical need for evaluating AI models against content representative of their intended application.

Key findings:

  1. Technical pronunciation remains a hurdle: Many models, both open-weight and closed-source, struggled with acronyms, symbols (!=, ..., $), numerical formats (Roman numerals, date ranges, decimals), and specific terms (LaTeX, ArXiv). Achieving accurate out-of-the-box narration for complex technical text is still a significant challenge.

  2. Open-weight vs. closed-source trade-offs:

    • Open-weight models (Kokoro, SparkTTS, Zonos, CSM) offer potential cost savings through self-hosting. However, their out-of-the-box accuracy on our test was generally lower, with some (CSM, Zonos) performing very poorly. Crucially, Kokoro provides deep customization via user-defined pronunciations.

    • Closed-source models (OpenAI, Amazon, Cartesia) generally provided better baseline accuracy and more polished voices. However, they incur higher recurring costs and typically lack mechanisms for users to fix remaining pronunciation errors.

  3. Customization is crucial for accuracy: The ability to correct errors, as demonstrated with Kokoro, transformed a moderate performer into a top contender for our specific needs. This capability is invaluable for domain-specific jargon or evolving terminology.

  4. Cost varies dramatically: Costs ranged from potentially under $1/1M characters (self-hosted small models, optimized) to over $40/1M characters (premium APIs). Cost significantly impacts feasibility for large-scale use. Newer API models like GPT-4o mini TTS offer better accuracy, but at a higher price.

  5. Voice quality matters: While secondary to accuracy, voice clarity, naturalness, and appropriate energy are vital for listenability. We found high-quality voices across models but also encountered issues like excessive emotiveness, robotic tones, or distracting artifacts (breathing, awkward pauses).

Recommendations for research paper narration:

  • For maximum accuracy & customization (cost-conscious): Kokoro 1.0 (modified) emerges as a strong choice. It requires an initial investment in creating a custom pronunciation dictionary, but its low running cost and potential for high accuracy make it ideal if customization effort is feasible.

  • For best out-of-the-box accuracy (cost permitting): OpenAI's GPT-4o mini TTS offered the best out-of-the-box accuracy among the APIs we tested, despite its high price (~$42/1M chars). It notably outperformed its predecessors (TTS-1, TTS-1 HD) on technical content. Amazon Polly Neural ($16/1M chars) offers a reasonable accuracy/cost balance if slightly less emotive voices are acceptable.

  • Proceed with caution: Models like Sesame CSM and Zyphra Zonos exhibited severe quality or accuracy issues in our tests, making them unsuitable for this use case without significant improvements. High-priced APIs like OpenAI GPT-4o mini TTS, Amazon Polly Generative ($30/1M chars), and Cartesia Sonic 2.0 (~$38/1M chars) require careful cost-benefit analysis, weighing their price against their specific performance on your representative content. OpenAI TTS-1/TTS-1 HD were decent but less accurate than the newer GPT-4o mini TTS.

Ultimately, we strongly recommend conducting your own tests using domain-specific content. Our findings underscore that general benchmarks don't capture the nuances of specialized applications. We hope this detailed, use-case-specific review provides a valuable starting point and encourages further transparency and targeted evaluations within the AI community.

About us:

We are the creators of Paper2Audio, a free tool designed to make research papers accessible via audio. Paper2Audio reads your research papers to you using AI, offering summary and full paper modes, summarizing visual elements like tables and figures, and omitting extraneous text like page numbers. Our team consists of:

  • Joe Golden: Entrepreneur & economist, former Collage.com co-founder/CEO (~$100M e-commerce company when sold in 2021, bootstrapped), Google economist, and Microsoft software engineer.

  • Chandradeep Chowdhury: Software engineer, formerly at Amazon AWS.

Get in touch:

  • Let us know if you feel any of our review content is inaccurate or out of date. Our email addresses are our first names @Paper2Audio.com.

  • Model creators: If you've made substantive improvements and would like us to re-evaluate, please reach out. If you'd like your model reviewed using this methodology, contact us.

  • Users: What other models should we review? Are you interested in similar torture tests for other AI modalities (e.g., vision-language models, document content extraction)?


Changelog

  • April 21, 2025: Initial post published.

  • May 21, 2025: Minor edits and formatting improvements.

  • May 22, 2025: Added audio version of post.