The Spoken Dutch Corpus project
 

The Spoken Dutch Corpus project was aimed at the construction of a database of contemporary standard Dutch as spoken by adults in The Netherlands and Flanders. The intended size of the corpus was ten million words (about 1,000 hours of speech), two thirds of which would originate from the Netherlands and one third from Flanders. In this release, the results are presented that have emerged from the project. The total number of words available here is nearly 9 million. Some 3.3 million words were collected in Flanders, well over 5.6 million in The Netherlands.

The corpus comprises a large number of samples of (recorded) spoken text. The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available.

Parts of the corpus have already been made available in the course of the project through a number of intermediate releases. The present release replaces all earlier releases.

The project was funded by the Flemish and Dutch governments and the Netherlands Organization for Scientific Research (NWO). The total budget was some 4.9 million euros. The Dutch Language Union (Nederlandse Taalunie) holds all rights. It is therefore not permitted to reproduce or make public part(s) of the data in any fashion without prior written permission from the Dutch Language Union.

The corpus may be used for scientific research and the development of non-commercial products. In these products the contributions made by individual speakers may not be present in such a fashion that these can be identified. With a commercial license the corpus may be used for the development of commercial derivative products such as speech recognizers and language models. In these products the contributions made by individual speakers may not be present in such a fashion that these can be identified.  



Background and motivation

Standard Dutch is the official language in the Netherlands (some 15 million people speak northern standard Dutch), in Flanders (the northern part of Belgium, about 5.6 million people speak southern standard Dutch), in Surinam (some 360,000 speakers, about 50 per cent of which live in the Netherlands) and the Dutch Antilles (some 240,000 speakers). While variants of the same language, there are considerable differences between northern standard Dutch and southern standard Dutch. These differences occur with regard to syntax, morphology, lexicon and phonetics/phonology.

As one of the smaller languages in Europe, Dutch is under serious threat of gradually disappearing as it is losing ground to English. The availability of the necessary resources has placed the English language and speech technology in the leading position it holds today and has further strengthened the position of English for business communication. The fact that for Dutch few relevant resources were available has formed a serious complication for the advancement of Dutch language and speech technology. The Spoken Dutch Corpus project sought to ameliorate this situation.

Apart from the interests held by language and speech technologists, the corpus is intended to serve several other research interests. For one, the corpus is of importance to linguists from various backgrounds. So far for Dutch only written text corpora were available. As a consequence, Dutch descriptive linguistics has focused on the written language, while there is as yet hardly any systematic knowledge of the much more evasive spoken form of the language. A corpus of spoken Dutch is also relevant for teaching. A thorough knowledge of everyday language use is essential to the development of course materials for the teaching of Dutch as a second language as well as for teaching Dutch in primary and secondary schools. 



Project organisation

The Spoken Dutch Corpus project was directed by a board whose members included representatives of the two governments, the Dutch Language Union, the Dutch and Flemish research foundations, and the Nijmegen Max Planck Institute (MPI). At the start of the project, Prof. W.J.M. Levelt was Chairman of the board. When he stepped down, he was succeeded by Prof. S. Nooteboom of the Landelijke Onderzoekschool Taalkunde (LOT), while Prof. W. Vonk of the MPI (also KUN) became a board member.

Appointed by the board there was a steering committee which consisted of experts from various linguistic (sub)disciplines and expert language and speech technologists, that was responsible for the project's progress and finances.

The project was coordinated at two national coordinating sites: Ghent for Flanders and Nijmegen for the Netherlands. Each site was directed by a project manager. The project managers in collaboration with three specialist working groups (one for corpus design, one for signal processing, and one for corpus annotation were responsible for the design and implementation of the various work packages. Thus the working group for corpus design was responsible for the design and composition of the corpus, speaker recruitment and the acquisition of recordings. The working group signal processing was in charge of the development of the protocol for orthographic transcription, and the development and implementation of the procedures that were used. In a similar fashion, they were involved in word segmentation, phonetic transcription and prosodic annotation. Finally, the working group for corpus annotation was responsible for POS tagging, lemmatisation, lexicon link-up and syntactic annotation, while the development of the Spoken Dutch Corpus lexicon also resided under this working group.

Apart from the afore-mentioned working groups, a number of expert committees were appointed. These had an advisory role with regard to the following matters: the development of corpus exploitation software, evaluation, and prosodic annotation.

The project's secretariat provided general organizational and administrative services.

A user group was set up whose principal role was to monitor and critically assess the design and implementation of procedures and protocols and to evaluate (intermediate) results.  



Work packages

The project aimed to compile a ten-million-word corpus that would constitute a plausible sample of contemporary standard Dutch as spoken in Flanders and the Netherlands. One third of the data was to be collected in Flanders, while two thirds were to originate from the Netherlands. The entire corpus was to be orthographically transcribed and tagged for part-of-speech information. In addition, for a selection of one million words more advanced transcriptions and annotations were envisaged.

The following work packages were distinguished: 

Corpus design and compilation

More information on the design of the corpus and its motivation can be found here.

Recording and digitization

Recordings were made by people working for the project or - as for example in the case of spontaneously spoken dialogues - by volunteers who had kindly agreed to record conversations conducted in their homes. Recordings were also obtained through collaboration with other projects, companies, organizations, and institutions. All recordings were digitized. With the exception of the telephone conversations, all material was stored in an uncompressed 16 bit, 16 kHz wav format (for more information on the wav format, see here). Information about the recording conditions, the equipment used, etc. is available as part of the metadata.

It is possible to play the audio files in wav format by means of the PRAAT or the COREX software. Alternatively, most other audio players, both on pc and other platforms, can handle the wav files. Both PRAAT and COREX make it possible for users to play the recordings and at the same time view the orthographic transcript.

Orthographic transcription

Of all recordings a verbatim transcript is made. The transcripts conform to a large extent to standard spelling conventions. A protocol (Goedertier & Goddijn, 2000; here available in .ps and .pdf format; Dutch only) has been developed which describes in detail what to transcribe and how to deal with new words, dialect, mispronunciations, and so on. Background noises are not represented in the transcript.

To facilitate the transcription process, use was made of the interactive signal processing tool PRAAT that was developed by Paul Boersma at the University of Amsterdam. In PRAAT it is possible to listen to and visualize the speech signal and at the same time create and view the orthographic transcript. Each speaker was assigned his or her own tier.  

During the transcription process, transcribers segmented the audio files in relatively short chunks (of approx. 2 to 3 seconds each) by inserting time markers in unfilled pauses between words. At a later stage these markers were used as anchor points for the automatic alignment of the transcript and the speech file.

For more information about the orthographic transcription, see orthography/info.htm.

(Photograph: D. van Aalst, KUN)

Lemmatisation and POS tagging

The entire corpus has been tagged for part-of-speech information. Within the project a tag set was defined which consists of 316 tags. The tag set closely follows the Algemene Nederlandse Spraakkunst (ANS, the authoritative Dutch reference grammar; Haeseryn et al., 1997). The tag set conforms to the EAGLES guidelines. A description can be found in Van Eynde (2003; here available in .pdf format; Dutch only).

Tagging was done by means of the POS tagger that was developed at the University of Tilburg. The tagger was used to assign the most likely tag to each word in the text. The output of the tagger was then checked (manually) and where necessary corrected. Apart from the POS tag, for each word also the associated lemma is given. A lemmatiser was used to automatically associate with each token the appropriate lemma. The result was manually checked and corrected.

For more information about the POS tagging, see pos tagging/info.htm.
For more information about the lemmatisation, see lemmatisation/info.htm.

Lexicon link-up

Within the project a lexicon has been developed. The lexicon has played an important role in the transcription and annotation of the corpus, while it also constitutes a possible way of accessing the data. The lexicon link-up made it possible to realise a more advanced form of lemmatisation in which the constituent parts of split verbs and foreign multi-word units were considered jointly and a lemma was associated with the combination as a whole. The protocol that was used (Piepenbrock 2004) is available here in .ps and .pdf format (Dutch only).

For more information, see lex linkup/info.htm.

Broad phonetic transcription

For about one million words a (verified) broad phonetic transcription is available. The protocol that has been used (Gillis 2001) is available here in .ps and .pdf format (Dutch only). The transcriptions were made using the PRAAT program.

For more information about the broad phonetic transcription, see phonetics/info.htm.

(Photograph: D. van Aalst, KUN)

Word segmentation

For all data for which a verified phonetic transcription is available, the speech file has been aligned with the orthographic transcript on the level of the word. The output has been checked manually and corrected. The protocol (Binnenpoorte 2002, 2004) is available here in .ps and .pdf format (Dutch only). For the remainder of the material the speech files and transcripts have been aligned automatically, while there has been no manual verification.

For more information about the word segmentation, see word align/info.htm.

Syntactic annotation

For the syntactic annotation of a selection of one million words an annotation scheme has been developed. Before annotation of the corpus was begun, the scheme was applied to several test samples. This led to some adaptations after which the protocol was finalised. Syntactic annotation was carried out semi-automatically, using the Annotate software. The protocol that has been used (Hoekstra et al. 2003) is available here in .pdf format (Dutch only).

For more information about the syntactic annotation, see syntax/info.htm.

Prosodic annotation

For approximately 250,000 words a prosodic annotation is available. The annotation encompasses the identification of the most important phrase boundaries as well as the one or two most important words (sentence accents) of each phrase. The protocol that has been used (Martens 2003) is available here in .ps and .pdf format (Dutch only).

For more information about the prosodic annotation, see prosody/info.htm.

Development of exploitation software

Within the Spoken Dutch Corpus project exploitation software was developed by the technical group at the MPI. The software provides easy and efficient access to the data.

For more information (incl. documentation) about the exploitation software, see corex/info.htm.

 


Time table  

The Spoken Dutch Corpus project ran for some five years. The official starting date was 1 June 1998. During the first year much time was invested in corpus design, the development of various protocols (especially for making the recordings, signal processing, the archiving of the data, the orthographic transcription and the broad phonetic transcription) and the selection and adaptation of tools and resources (such as the lexicon). The corpus was compiled incrementally. The project ended 1 March 2004. In January 2006 Version 2.0 of the CGN was released. An overview of the differences between Version 1.0 and Version 2.0 can be found here.



Dissemination of the results
 

Parts of the corpus were already made available in the course of the project through a number of intermediate releases. The present release (version 1.0) of the complete corpus replaces all previous releases. Version 1.0 includes the results of the Spoken Dutch Corpus project as they were available when the project ended on 1 March 2004. In all, it comprised 33 DVDs, 32 of which contain the sound files that are part of the corpus.

In January 2006 Version 2.0 was released. An overview of the differences between Version 1.0 and Version 2.0 can be found here.

The distribution of the corpus - inclusive of the recordings - is handled by the TST centre (INL). 



Publications
 

Apart from the various protocols and working documents that were produced during the project, there are also various publications that have been published. In these papers various aspects of the design and annotation of the corpus are discussed in some detail. For an overview we refer to the list of publications.