Corpus Gesproken Nederlands (CGN)

Het Corpus Gesproken Nederlands (CGN) is een verzameling van 900 uur (bijna 9 miljoen woorden) hedendaagse Nederlandse spraak, afkomstig van Vlamingen en Nederlanders. De spraakfragmenten (spontaan en voorbereid) zijn opgelijnd met diverse transcripties (o.a. orthografisch, fonetisch) en annotaties (syntactisch, POS-tags). Metadata, lexica en frequentielijsten behoren ook tot het CGN.

Het corpus wordt geleverd met Corex, de corpusexploratiesoftware, maar hou er rekening mee dat de software verouderd is en dat die niet meer geüpdatet of ondersteund wordt.

Naast het Corpus Gesproken Nederlands zijn de CGN-annotaties ook apart te verkrijgen. Deze annotaties zijn identiek aan het volledige Corpus Gesproken Nederlands, maar dan zonder de geluidsbestanden.

Voor commercieel gebruik zie de commerciële productpagina.

A collection of about 900 hours spoken standard Dutch from Flanders and the Netherlands. The speech recordings (spontaneous and prepared) are lined up with various transcriptions (including orthographic, phonetic) and annotations (syntactic, POS tags). Metadata, lexicons and frequency lists are also part of the CGN.

The corpus comes with Corex, the corpus exploration software, but please note that the software is outdated and is no longer updated or supported.

In addition to the Corpus Gesproken Nederlands, the CGN annotations are also available separately. These annotations are identical to the full Corpus Gesproken Nederlands, but without the sound files.

For commercial use, see the commercial product page.

Dit product is gratis, maar het tekenen van een licentie is vereist. De download bevat de licentie en verdere instructies voor het plaatsen van een bestelling.

This product is free, but signing a license agreement is required. The download contains the license and further instructions for placing an order.

Productdetails

Aantal uren spraak	900
Dataformaat	Spraakbestanden (wav), annotaties (xml en txt)
Documentatie	Over het Corpus Gesproken Nederlands (pdf)Zoekacties en codes in het CGN (pdf), en de interactieve documentatie (verwijzingen naar de data zijn niet actief).
Eigenaar	Taalunie
Financier	Vlaamse en Nederlandse regering en NWO
Jaar	2014
Opdrachtgever	NWO/NTU
Project	Corpus Gesproken Nederlands
Refereren	Corpus Gesproken Nederlands - CGN (Version 2.0.3) (2014) [Data set]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-k6
Software	Corex
Talen	Nederlands, Vlaams
Toepassing	Onderzoek, testen van spraakherkenners
Webcursus	CGN-webcursus
Versie	2.0.3

Downloaddetails

Bestand
BP_CGN_NC.zip

Aantal bestanden 1
Aantal downloads 2218
Bestandsgrootte 96,432.56 MB
Datum plaatsing 03/09/2020
Laatst bijgewerkt 30/04/2026
Versie 2.0.3