TwNC: a Multifaceted Dutch News Corpus
Ordelman, Roeland and Jong de, Franciska and Hessen van, Arjan and Hondorp, Hendri (2007) TwNC: a Multifaceted Dutch News Corpus. ELRA Newsletter, 12 (3-4).
| PDF 179Kb |
| Abstract: | This contribution describes the Twente News Corpus (TwNC), a multifaceted corpus for Dutch that is being deployed in a number of NLP research projects among which tracks within the Dutch national research programme MultimediaN, the NWO programme CATCH, and the Dutch-Flemish programme STEVIN.
The development of the corpus started in 1998 within a predecessor project DRUID and has currently a size of 530M words. The text part has been built from texts of four different sources: Dutch national newspapers, television subtitles, teleprompter (auto-cues) files, and both manually and automatically generated broadcast news transcripts along with the broadcast news audio. TwNC plays a crucial role in the development and evaluation of a wide range of tools and applications for the domain of multimedia indexing, such as large vocabulary speech recognition, cross-media indexing, cross-language information retrieval etc. Part of the corpus was fed into the Dutch written text corpus in the context of the Dutch-Belgian STEVIN project D-COI that was completed in 2007. The sections below will describe the rationale that was the starting point for the corpus development; it will outline the cross-media linking approach adopted within MultimediaN, and finally provide some facts and figures about the corpus. |
| Item Type: | Article |
| Copyright: | © 2007 European Language Resources Association |
| Faculty: | Electrical Engineering, Mathematics and Computer Science (EEMCS) |
| Research Group: | |
| Link to this item: | http://purl.utwente.nl/publications/68090 |
| Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page

Show download statistics for this publication
Show download statistics for this publication