Improving ASR for Child Speech

Work with Speech Processing and Auditory Perception Lab at UCLA

Parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) make it cheap to adapt speech foundation models to new domains, but they modify only the global attention mechanism and miss the local acoustic context that often defines a mismatched domain. This work proposes GC-LoRA, an adapter that injects Conformer-style local convolutional processing into a pretrained Transformer encoder by attaching a lightweight module to the encoder’s attention output projections. The adapter captures local acoustic dependencies — reverberation, telephony band-limiting, dialectal variation, and the physiological characteristics of child speech — without disrupting the model’s pretrained global representations. Across acoustically degraded, bandlimited, dialectal, and child speech datasets, GC-LoRA achieves Word Error Rate reductions of up to 10.9% over baselines while adding minimal trainable parameters.

This work has been accepted at Interspeech 2026.

The code can be accessed here

A complementary line of work asks which layers of a speech foundation model should be adapted, rather than how. Gumbel-BEARD is a domain adaptation framework that automates Whisper encoder layer selection through an end-to-end trainable hard Gumbel-Softmax selector, paired with a BEST-RQ self-supervised objective so the model adapts to target acoustic characteristics without manual tuning or transcribed data. On the MyST child speech corpus, fine-tuning with only 10 hours of labeled data matches a fully supervised baseline trained on the complete 133-hour set, and the method sets new state-of-the-art Word Error Rates of 8.21% (Whisper-medium on MyST) and 11.06% (Whisper-small on OGI Spontaneous). Evaluation on CORAAL further shows robustness to adult dialectal shifts, with up to 6% relative WER reduction.

This work has been accepted at Interspeech 2026.

Moving from single-model adaptation to unified multi-domain recognition, a related effort proposes a Mixture-of-Experts (MoE) Speech-LLM that handles adult and child speech across diverse environments and age groups within a single system. A classifier-based domain router follows a coarse-to-fine strategy and combines a Mixture-of-Projectors and a Mixture-of-LoRAs to model domain-specific variation, while an entropy-aware routing mechanism dynamically falls back to a shared expert when the router is uncertain near domain boundaries. Experiments on public child corpora show consistent improvements over baselines while preserving adult ASR performance; to our knowledge, this is the first work to use Speech-LLMs for unified, multi-domain ASR spanning both children and adults.

This work has been accepted at Interspeech 2026.

We also study discrete speech tokenization for child ASR, where representing speech as compact discrete tokens enables storage efficiency and integration with large language models. Discrete tokens are typically split into acoustic and semantic varieties, with semantic tokens being more useful for recognition. This work systematically compares the traditional unsupervised approach (K-means clustering over speech foundation model features) against supervised tokenization (finite scalar quantization trained with an ASR loss). Supervised semantic tokens not only outperform unsupervised ones but unexpectedly surpass even continuous representations, and they remain effective in ultra-low bitrate settings — offering practical guidance for discrete speech tokenization in low-resource tasks like child ASR.

This work was presented at the AI4CSL workshop at ASRU 2025, and can be accessed here

Speech foundation models finetuned on certain domains, such as LibriSpeech (adult read speech), behave poorly on other domains (child or noisy speech). One solution could be collecting as much labelled data as possible for joint finetuning on various domains. However, collecting target domain speech-text paired data and retraining the model is often costly and computationally expensive. In this paper, we introduce a simple yet effective method, speech-only adaptation (SOA), based on speech foundation models (Wav2vec 2.0), which requires only speech input data from the target domain. Specifically, the Wav2vec feature encoder is continually pretrained with the Wav2vec loss on both the source and target domain data for domain adaptation, while the contextual encoder is frozen. Compared to a source-domain finetuned model with the feature encoder being frozen during training, we find that simply replacing the frozen feature encoder with the adapted one provides significant WER improvements to the target domain while preserving the performance of the source domain. The effectiveness of this SOA is examined on various low-resource or domain-mismatched ASR settings including adult-child and clean-noisy speech.

This work was presented at the Self-supervision in Audio, Speech and Beyond workshop in ICASSP 2024, and can be accessed here

Speech foundation models (SFMs) have achieved state-of-the-art results for various speech tasks in supervised (e.g. Whisper) or self-supervised systems (e.g. WavLM). However, the performance of SFMs for child ASR has not been systematically studied. In addition, there is no benchmark for child ASR with standard evaluations, making the comparisons of novel ideas difficult. In this paper, we initiate and present a comprehensive benchmark on several child speech databases based on various SFMs (Whisper, Wav2vec2.0, HuBERT, and WavLM). Moreover, we investigate finetuning strategies by comparing various data augmentation and parameter-efficient finetuning (PEFT) methods. We observe that the behaviors of these methods are different when the model size increases. For example, PEFT matches the performance of full finetuning for large models but worse for small models. To stabilize finetuning using augmented data, we propose a perturbation invariant finetuning (PIF) loss as a regularization.

Our code can be accessed here

This work was presented at Interspeech 2024, and can be accessed here

A follow-up study examines how self-supervised speech representations shift across speaker age and how that shift drives ASR degradation on child speech. We introduce Delta SSL embeddings, frame-level differences between SSL representations of an utterance and a reference, and show that incorporating these delta features into ASR training improves robustness on children’s speech without retraining the underlying foundation model — a practical recipe when target-domain labeled data is scarce.

This work was presented at ICASSP 2026, and can be accessed here

A collaborative effort with the ESPnet community provides a systematic benchmark of training paradigms, dataset compositions, and model scaling for child ASR within the ESPnet toolkit. The study covers supervised vs. self-supervised pretraining, multi-corpus mixing strategies, and parameter scaling across model sizes, providing reproducible recipes and baselines that the broader community can build on for child ASR research.

This work was presented at WOCCI 2025, and can be accessed here