The project aimed to design a
10-million-word corpus that would constitute a plausible sample of
contemporary standard Dutch as spoken in Flanders and the Netherlands.
One third of the data were to be collected in Flanders, two thirds were
to originate from the Netherlands. The entire corpus was to be
transcribed orthographically, lemmatised and enriched with
part-of-speech information. Users should be able to access the speech
recordings through pointers in the transcriptions. For a selection of
one million words it was envisaged that an auditorily verified, broad
phonetic transcription would be available, while for this part of the
corpus the automatic time alignment would be manually checked on the
level of the word. For most of the recordings which were not checked by
hand the pointers were expected to be accurate within less than 100 ms.
Also for one million words, a syntactic annotation was envisaged and
250,000 words were to receive a prosodic annotation.
Original overall design (autumn 1998)
The design of the corpus was guided by a number of considerations. First of all, there was the fact that the corpus was to serve many and diverse interests. Different user groups have different requirements when it comes to the quality and quantity of the data, the number and type of speakers, and so on. Second, the total budget available for the entire project was fixed at 4.6 MEuro, i.e. this should cover all costs involved in recording and collecting data, transcribing and annotating these data, etc. And finally, the issue of copyright complicated matters. Since the corpus was to be distributed including the speech files, the consent of all speakers was required as well as of any parties that had any rights to the recorded material.
The design of the corpus took into
account the various dimensions underlying the variation that can be
observed in language use. In the overall design of the corpus the
principal parameter was taken to be the socio-situational setting in
which language is used. This led us to distinguish a number of
components, each of which could be characterised in terms of its
situational characteristics such as communicative goal, medium, number
of speakers participating, and the relationship between speaker(s) and
hearer(s).
The
specification of each of the components was given in terms of sample
sizes, total number of speakers, range of topics, etc. Where this was
considered to be of particular interest, speaker characteristics such
as gender, age, geographical region, and socio-economic class were used
as (demographic) sampling criteria; otherwise they were merely recorded
as part of the metadata. The overall design of the corpus is given in
Table 1.
Table 1. Original overall design of the corpus (autumn 1998)
Flanders | The Netherlands | ||||||
---|---|---|---|---|---|---|---|
dialogue
/ multilogue 8,110,000
|
private
6,635,000
|
unscripted
6,635,000
|
direct
3,460,000
|
conversations
('face-to-face')
3,000,000
|
1,000,000
|
2,000,000
|
|
interviews
460,000
|
230,000
|
230,000
|
|||||
distanced
3,175,000
|
telephone conversations
3,000,000
|
1,000,000
|
2,000,000
|
||||
business transactions
175,000
|
0
|
175,000
|
|||||
public
1,475,000
|
broadcast
750,000
|
more
or less scripted
750,000
|
interviews and discussions
750,000
|
230,000
|
520,000
|
||
non-broadcast
725,000
|
unscripted
725,000
|
discuss., debates, meetings
375,000
|
130,000
|
245,000
|
|||
lectures
350,000
|
110,000
|
240,000
|
|||||
monologue
1,890,000
|
private
40,000
|
more
or less scripted
40,000
|
descriptions of
pictures
40,000
|
40,000
|
0
|
||
public
1,850,000
|
broadcast
950,000
|
unscripted
250,000
|
spontaneous commentary
250,000
|
70,000
|
180,000
|
||
more or less scripted
700,000
|
newsreports, current affairs
programmes
250,000
|
80,000
|
170,000
|
||||
news
250,000
|
80,000
|
170,000
|
|||||
commentary
200,000
|
60,000
|
140,000
|
|||||
non-broadcast
900,000
|
more or less scripted
900,000
|
lectures, speeches
275,000
|
95,000
|
180,000
|
|||
read aloud text
625,000
(+375,000)
|
210,000
(+125,000)
|
415,000
(+250,000)
|
While the project was on-going, the design and considerations described above were taken as guidelines. However, as the project progressed data collection of part of the data fell behind schedule. Therefore, half-way through the project, it was decided to adapt the design somewhat. Certain components that had not yet (fully) been realised were reduced or canceled. Then, as one came to the end of the project and the structure of the final release was being considered, it was found that a re-structuring of the corpus would be in the interest of the user. The structure of the corpus as it is distributed in the present version is represented in Table 2.
Table 2. Components distinguished
in the Spoken Dutch Corpus (version 1.0 and 2.0)
Components: | |
---|---|
a.
|
Spontaneous conversations ('face-to-face') |
b.
|
Interviews with teachers of Dutch |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
d. | Spontaneous telephone dialogues (recorded on MD via a local interface) |
e. | Simulated business negotiations |
f. | Interviews/discussions/debates (broadcast) |
g. | (political) Discussions/debates/meetings (non-broadcast) |
h.
|
Lessons recorded in the classroom |
i.
|
Live (e.g. sports) commentaries (broadcast) |
j.
|
Newsreports/reportages (broadcast) |
k.
|
News (broadcast) |
l.
|
Commentaries/columns/reviews (broadcast) |
m.
|
Ceremonious speeches/sermons |
n.
|
Lectures/seminars |
o.
|
Read speech |
This is not the place to discuss in detail the sampling procedure that was employed with each component. Here we restrict ourselves to giving a short overview of the different sampling criteria and the (possible) ways in which they have been applied. Please note that not all sampling criteria apply to all components.
Sample size For the entire corpus it is true that a
sample is a stretch of connected discourse. The sizes of the different
samples differ. In a number of instances, e.g. for the samples making
up component o (read speech), a minimum size was specified so as to
meet the requirements specified by users from a particular field. On
the whole, natural break-off points such as changes of turn, changes of
item (in a news broadcast), etc. have been used to delimited the
samples.
Number of speakers per component In principle the number of speakers may vary. For a
number of components, viz. the spontaneous conversations (component a),
the interviews (component b), the telephone dialogues (components c and
d) and the read aloud text (component o), the number of speakers was
specified beforehand.
Speaker characteristics Speaker
characteristics that have played a role as sampling criteria are sex,
age, geographical region, socio-economic class and level of education.
Quality of the recording The quality of the recordings varies. Of
course high quality was aimed for. However, recording conditions were
rather varied so that not in all cases the quality is equally high.
For an overview of the data that are
available and their distribution over various components, we refer to
the overview of available data.
Selections for which more advanced annotations were envisaged (autumn 1998)
Once the overall design of the corpus had been established, it remained to be decided which part(s) of the corpus should be included in the selection of one million words (or 250,000 words in the case of prosodic annotation) for which more advanced annotations were envisaged. Preferably, the selection should in some way reflect the composition of the full corpus. While it would have been straightforward to simply select 10 per cent of each component, there were two powerful arguments that were raised against this procedure. First, there was the given fact that some user groups required certain minimum amounts of data with specific higher level (or more advanced) annotations that exceeded the 10 per cent norm. Second, not all types of data could be annotated with the same rate of success and/or at the same expense. Therefore, in the light of the quality standards that were upheld and the time and money available, certain types of data were given priority over other types. The selections that were decided upon for each type of advanced annotation are displayed in Table 2.
Table 3 gives an overview of the selections of parts of the corpus for which more advanced annotations were envisaged. The fourteen components that were distinguished here were the same as the ones referred to in the overall design. For each component it was indicated which part would be enriched with which types of annotations. Note that in the table only the size of each component is indicated (in number of words). The specific design of each component and the selection of samples depended on the quality of the speech signal, the distribution over various situational contexts, speakers, topics, etc.
Table 3. Selection of data for which more advanced transcriptions and annotations were envisaged (autumn 1998)
Component | Total
number of words in the corpus |
Amount
of data and types of annotation (in no. of words) |
|||
---|---|---|---|---|---|
+ alignment |
annotation |
annotation |
|||
1.
|
conversations ('face-to-face') |
3,000,000
|
150,000
|
550,000
|
100,000
|
2.
|
interviews |
460,000
|
50,000
|
50,000
|
20,000
|
3.
|
telephone conversations |
3,000,000
|
300,000
|
100,000
|
50,000
|
4. | business transactions |
175,000
|
15,000
|
15,000
|
10,000
|
5. | interviews and discussions |
750,000
|
75,000
|
75,000
|
10,000
|
6. | discussions, debates, meetings |
375,000
|
35,000
|
35,000
|
10,000
|
7. | lectures |
350,000
|
35,000
|
35,000
|
0
|
8.
|
descriptions of pictures |
40,000
|
5,000
|
5,000
|
0
|
9.
|
spontaneous commentary |
250,000
|
27,500
|
27,500
|
10,000
|
10.
|
newsreports, current affairs programmes |
250,000
|
25,000
|
25,000
|
10,000
|
11.
|
news |
250,000
|
27,500
|
27,500
|
10,000
|
12.
|
commentary |
200,000
|
25,000
|
25,000
|
10,000
|
13.
|
lectures, speeches |
275,000
|
30,000
|
30,000
|
10,000
|
14.
|
read aloud text |
625,000
(+ 375,000)
|
200,000
|
0
|
0
|
Total |
10,000,000
|
1,000,000
|
1,000,000
|
250,000
|
Actual realisation (version 2.0)
Within the project the targets that were set for the core corpus (selections of data for which additional transcription/annotations were provided) were met. For an overview, we refer to the overview of data with additional transcriptions and annotations.