arXiv:2508.12255v1 Announce Type: new
Abstract: Speech foundation models (SFMs) are designed to serve as general-purpose representations for a wide range of speech-processing tasks. The last five years have seen an influx of increasingly successful self-supervised and supervised pre-trained models with impressive performance on various downstream tasks.
Although the zoo of SFMs continues to grow, our understanding of the knowledge they acquire lags behind. This thesis presents a lightweight analysis framework using statistical tools and training-free tasks to investigate the acoustic and linguistic knowledge encoded in SFM layers. We conduct a comparative study across multiple SFMs and statistical tools. Our study also shows that the analytical insights have concrete implications for downstream task performance.
The effectiveness of an SFM is ultimately determined by its performance on speech applications. Yet it remains unclear whether the benefits extend to spoken language understanding (SLU) tasks that require a deeper understanding than widely studied ones, such as speech recognition. The limited exploration of SLU is primarily due to a lack of relevant datasets. To alleviate that, this thesis contributes tasks, specifically spoken named entity recognition (NER) and named entity localization (NEL), to the Spoken Language Understanding Evaluation benchmark. We develop SFM-based approaches for NER and NEL, and find that end-to-end (E2E) models leveraging SFMs can surpass traditional cascaded (speech recognition followed by a text model) approaches. Further, we evaluate E2E SLU models across SFMs and adaptation strategies to assess the impact on task performance.
Collectively, this thesis tackles previously unanswered questions about SFMs, providing tools and datasets to further our understanding and to enable the community to make informed design choices for future model development and adoption.