Skip to content

Detail of publication


Jáchym Kolář and Jan Švec : The Czech Broadcast Conversation Corpus . Text, Speech and Dialogue (TSD), Pilsen, 2009.

Download PDF



This paper presents the final version of the Czech Broadcast Conversation Corpus that will shortly be released at the Linguistic Data Consortium (LDC). The corpus contains 72 recordings of a radio discussion program, which yields about 33 hours of transcribed conversational speech from 128 speakers. The release does not only include verbatim transcripts and speaker information, but also structural metadata (MDE) annotation that involves labeling of sentence-like unit boundaries, marking of non-content words like filled pauses and discourse markers, and annotation of speech disfluencies. The MDE annotation is based on the LDC's annotation standard for English, with changes applied to accommodate phenomena that are specific for Czech. In addition to its importance to speech recognition, speaker diarization, and structural metadata extraction research, the corpus is also useful for linguistic analysis of conversational Czech.

Detail of publication

Title: The Czech Broadcast Conversation Corpus
Author: Jáchym Kolář ; Jan Švec
Language: Czech
Date of publication: 13 Sep 2009
Year: 2009
Type of publication: Papers in proceedings of reviewed conferences
Book title: Text, Speech and Dialogue (TSD), Pilsen
Date: 13 Sep 2009 - 17 Sep 2009
/ 2009-09-21 11:12:25 /


 author = {J\'{a}chym Kol\'{a}\v{r} and Jan \v{S}vec},
 title = {The Czech Broadcast Conversation Corpus},
 year = {2009},
 booktitle = {Text, Speech and Dialogue (TSD), Pilsen},
 url = {},