Generative Speech Error Correction for Child ASR

Automatic speech recognition systems trained primarily on adult speech consistently underperform on children, producing characteristic error patterns tied to higher pitch, greater spectral variability, disfluencies, and developing pronunciation. This project introduces CHSER, a dataset of paired ASR hypotheses and reference transcripts curated specifically for studying these error modes, and a case study on generative speech error correction (GenSEC) for child ASR. We analyze the recurring failure patterns of state-of-the-art ASR models on children’s speech, then evaluate both text-only large language model correction and acoustically conditioned speech-LLM correction. The acoustically conditioned variant consistently outperforms text-only correction by recovering cues from the audio that the original ASR system mistranscribed, demonstrating that grounded post-hoc correction is a promising path for low-resource child ASR.

This work was presented at Interspeech 2025, and can be accessed here