The CGN lexicon

The CGN lexicon

In this release a version of the CGN lexicon has also been included. This lexicon has been used in the production of the corpus for example for spell checking, lemmatisation of the tokens, assignment of the part of speech, etc. Lexical information has also been used in the production of the (automatic) phonetic transcriptions and the syntactic annotations.

The CGN lexicon has been based on already existing resources such as CELEX, RBN, PAROLE, FONILEX, Van Dale, de Woordenlijst Nederlandse Taal (Groene Boekje) and the Corpus Uit den Boogaart, and has been further adapted and extended for use with the Spoken Dutch Corpus. The lexicon comprises two parts: a standard lexicon with continuous word forms and a separate lexicon for multi-words. Both lexicons include only those lexical items that can actually be found to occur in the corpus. Excluded are items for which lexical information is irrelevant (e.g. hesitations and incomplete words). The lexicon files are available in two formats: flat ASCII (text) format and HTML for consultation by means of a web browser. The files can be found on this DVD in the directories /data/lexicon/text/ and /data/lexicon/html/ More detailed information about the standard lexicon and the multi-word lexicon can be found on README_cgnlex_V2.0_eng.htm and README_cgnmlex_V2.0_eng.htm respectively.

The lexicons contain information about the token (word form), part of speech, lemma, syntax, orthographic status, pronunciation, morphology and nature (continuous/discontinuous) of a multi-word expression.