In this release the results are available from the Spoken Dutch Corpus project. These results include the recordings and all accompanying transcriptions and annotations, documentation, the lexicon and the COREX exploitation software.
Below an overview is given of the
data that are available in this release. For an overview of the samples
that are available per component, we refer to the section on the sound recordings.
Overview of available data with basic
transcriptions and annotations
Table 1 presents an overview of the data for which apart from a sound recording also an orthographic transcription is available, a POS tagging, lemmatisation, an automatic broad phonetic transcription and an automatic word segmentation.
Table 1.
Overview of data in Version 2
('VL' refers to the Flemish data,
'NL' to the Dutch data)
Component | Total number of words | |||
---|---|---|---|---|
|
|
|||
a.
|
Spontaneous conversations ('face-to-face') |
2,626,172
|
878,383 | 1,747,789 |
b.
|
Interviews with teachers of Dutch |
565,433
|
315,554 | 249,879 |
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
1,232,636
|
489,100
|
743,537
|
d.
|
Spontaneous telephone dialogues (recorded on MD with local interface) |
853,371
|
343,167 |
510,204
|
e.
|
Simulated business negotiations |
136,461
|
0 | 136,461 |
f. | Interviews/discussions/debates (broadcast) |
790,269
|
250,708 | 539,561 |
g.
|
(political) Discussions/debates/ meetings (non-broadcast) |
360,328
|
138,819
|
221,509 |
h.
|
Lessons recorded in the classroom |
405,409
|
105,436
|
299,973
|
i.
|
Live (e.g. sports) commentaries (broadcast) |
208,399
|
78,022 | 130,377 |
j.
|
Newsreports/reportages (broadcast) |
186,072
|
95,206 | 90,866 |
k.
|
News (broadcast) |
368,153
|
82,855 | 285,298 |
l.
|
Commentaries/columns/reviews (broadcast) |
145,553
|
65,386 | 80,167 |
m.
|
Ceremonious speeches/sermons |
18,075
|
12,510 | 5,565 |
n.
|
Lectures/seminars |
140,901
|
79,067 | 61,834 |
o.
|
Read speech | 903,043 | 351,419 | 551,624 |
Total |
8,940,098
|
3,285,631 | 5,654,644 |
For more information about
Overview of data with additional transcriptions
and annotations
In Table 2a and 2b an overview is presented of the additional transcription and annotations that are available for a selection of samples. For those data for which a manually verified broad phonetic transcription is available, the word segmentation has been manually verified. Table 2a concerns the data from the Netherlands, Table 2 the Flemish data. For more information, see meta-data (sample information).
For more information about
Table 2a. Additional transcriptions
and annotations (data from The Netherlands)
Component | Quantity of data (number of words) with a | |||
---|---|---|---|---|
|
|
|
||
a.
|
Spontaneous conversations ('face-to-face') |
106,182
|
300,368
|
37,406
|
b.
|
Interviews with teachers of Dutch |
25,687
|
25,687
|
7,596
|
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
201,141
|
69,933
|
20,070
|
d. | Spontaneous telephone dialogues (recorded on MD with a local interface) |
0
|
0
|
0
|
e. | Simulated business negotiations |
25,485
|
25,485
|
7,485
|
f. | Interviews/discussions/debates (broadcast) |
75,106
|
75,106
|
7,537
|
g. | (political) Discussions/debates/meetings (non-broadcast) |
25,117
|
25,117
|
7,678
|
h.
|
Lessons recorded in the classroom |
25,961
|
25,961
|
0
|
i.
|
Live (e.g. sports) commentaries (broadcast) |
24,986
|
24,986
|
5,866
|
j.
|
Newsreports/reportages (broadcast) |
25,065
|
25,065
|
5,617
|
k.
|
News (broadcast) |
25,296
|
25,384
|
7,437
|
l.
|
Commentaries/columns/reviews (broadcast) |
25,071
|
25,071
|
7,541
|
m.
|
Ceremonious speeches/sermons |
5,184
|
5,184
|
978
|
n. | Lectures/seminars | 14,913 | 14,913 | 6,577 |
o.
|
Read speech |
70,223
|
0
|
0
|
Total |
675,417
|
668,260
|
121,788
|
Table 2b. Additional
transcriptions and annotations (data from Flanders)
Component | Quantity of data (number of words) with a | |||
---|---|---|---|---|
|
|
|
||
a.
|
Spontaneous conversations ('face-to-face') |
70,945
|
146,745
|
49,988
|
b.
|
Interviews with teachers of Dutch |
34,064
|
34,064
|
7,667
|
c.
|
Spontaneous telephone dialogues (recorded via a switchboard) |
68,886
|
19,886
|
19,874
|
d. | Spontaneous telephone dialogues (recorded on MD with a local interface) |
6,257
|
6,257
|
0
|
e. | Simulated business negotiations |
0
|
0
|
0
|
f. | Interviews/discussions/debates (broadcast) |
25,144
|
25,144
|
10,007
|
g. | (political) Discussions/debates/meetings (non-broadcast) |
9,009
|
9,009
|
5,414
|
h.
|
Lessons recorded in the classroom |
10,103
|
10,103
|
0
|
i.
|
Live (e.g. sports) commentaries (broadcast) |
10,130
|
10,130
|
6,002
|
j.
|
Newsreports/reportages (broadcast) |
7,679
|
7,679
|
6,054
|
k.
|
News (broadcast) |
7,305
|
7,305
|
6,248
|
l.
|
Commentaries/columns/reviews (broadcast) |
7,431
|
7,431
|
5,998
|
m.
|
Ceremonious speeches/sermons |
1,893
|
1,893
|
1,124
|
n. | Lectures/seminars | 8,143 | 8,143 | 3,880 |
o.
|
Read speech |
64,848
|
44,144
|
0
|
Total |
331,837
|
337,933
|
122,256
|
Here an overview is presented of all samples that are available in the various components in this release. Here it is also indicated on which DVDs the sound files can be found. For more detailed information about these samples (type of speech, duration, speakers, types of transcription and annotation, etc.) we refer to the metadata.