Efficient harvesting of Internet audio for resource-scarce ASR

Davel, MH; Van Heerden, C; Kleynhans, N; Barnard, E

Efficient harvesting of Internet audio for resource-scarce ASR

http://hdl.handle.net/10204/5769

Abstract:

Spoken recordings that have been transcribed for human reading (e.g. as captions for audiovisual material, or to provide alternative modes of access to recordings) are widely available in many languages. Such recordings and transcriptions have proven to be a valuable source of ASR data in well-resourced languages, but have not been exploited to a significant extent in under-resourced languages or dialects. Techniques used to harvest such data typically assume the availability of a fairly accurate ASR system, which is generally not available when working with resourcescarce languages. In this work, the authors define a process whereby an ASR corpus is bootstrapped using unmatched ASR models in conjunction with speech and approximate transcriptions sourced from the Internet. They introduce a new segmentation technique based on the use of a phone-internal garbage model, and demonstrate how this technique (combined with limited filtering) can be used to develop a large, high-quality corpus in an underresourced dialect with minimal effort.

Reference:

Davel, MH, Van Heerden, C, Kleynhans, N and Barnard, E. Efficient harvesting of Internet audio for resource-scarce ASR. 12 Annual Conference of the International Speech Communication Association (Interspeech 2011), Florence, Italy, 27-31 August 2011

Davel, M., Van Heerden, C., Kleynhans, N., & Barnard, E. (2011). Efficient harvesting of Internet audio for resource-scarce ASR. The International Speech Communication Association. http://hdl.handle.net/10204/5769

Davel, MH, C Van Heerden, N Kleynhans, and E Barnard. "Efficient harvesting of Internet audio for resource-scarce ASR." (2011): http://hdl.handle.net/10204/5769

Davel M, Van Heerden C, Kleynhans N, Barnard E, Efficient harvesting of Internet audio for resource-scarce ASR; The International Speech Communication Association; 2011. http://hdl.handle.net/10204/5769 .

Download RIS

12 Annual Conference of the International Speech Communication Association (Interspeech 2011), Florence, Italy, 27-31 August 2011

Davel, MH
Van Heerden, C
Kleynhans, N
Barnard, E

Aug 2011

Speech recognition
Under-resourced languages
Garbage modeling
Automatic speech recognition (ASR)

Show full item record

Files in this item

Davel_2011.pdf

This item appears in the following Collection(s)

Conference Publications

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

Efficient harvesting of Internet audio for resource-scarce ASR

Efficient harvesting of Internet audio for resource-scarce ASR

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect