Frequency lists
On the basis of the data in the
corpus a number of frequency lists were derived that provide
information regarding the frequency of occurrence of word forms, POS
tags and lemmas and combinations of these. There is also a frequency
list of all the word forms and their phonetic transcriptions which is
based on the data for which a manually verified phonetic transcription
is available. The frequency lists can be found in de directory
/data/lexicon/freqlists/
of this DVD; all files can be identified on
the basis of the extension .frq. To the word forms special codes may be
attached. Between the word form and the code a slash forward is used
(e.g. wonderful/foreign). The following codes are used:
- 'dialect' for dialect words;
- 'foreign' for foreign words;
- 'incomplete' for incomplete
words;
- 'mispr' for words that were
mispronounced;
- 'regionalpr' for words that are
pronounced with strong local/regional accent;
- 'uncertain' for words that are
difficult to hear.
The following types of frequency list
are distinguished:
- totalph
an alphabetical word frequency list in
which the frequency of occurrence is listed of all the word forms in
all the data in this release; the columns list the following
information:
- the rank number of the word
form;
- the total frequency of the
word form in the entire corpus;
- the word form.
- totrank
a word frequency list presented as a
rank order list, again based on all the data in the corpus; the columns
list the following information:
- the rank number of the word
form, the highest ranking item occurring at the top of the list;
- the total frequency of the
word form in the entire corpus;
- the word form.
- areaalph
an alphabetical word frequency list in
which a distinction is made between data originating from Flanders and
data originating from the Netherlands; the columns list the following
information:
- the rank number of the word
form;
- the total frequency of the
word form in the Dutch data;
- the total frequency of the
word form in the Flemish data;
- the total frequency of the
word form in the entire corpus;
- the word form.
- arearank
a word frequency list presented as a
rank order list in which a distinction is made between data originating
from Flanders and data originating from the Netherlands; the columns
list the following information:
- the rank number of the word
form, the highest ranking item occurring at the top of the list;
- the total frequency of the
word form in the Dutch data;
- the total frequency of the
word form in the Flemish data;
- the total frequency of the
word form in the entire corpus;
- the word form.
- typealph
an alphabetical word frequency list in
which the 15 components (speech types) in the corpus are distinguished;
the columns list the following information:
- the rank order of the word
form;
- the total frequency of the
word form in components a-o;
- (...)
- the total frequency
of the word forms in the entire corpus;
- the word form.
an alphabetical frequency list of the
words in the Dutch part of the data. In the list a distinction is made
between the different components (speech types) in the corpus;
the columns list the following information:
- the rank order of the word
form in the Dutch part of the data;
- the total frequency of the
word form in components a-o;
- (...)
- the total frequency
of the word forms in the Dutch part of the corpus;
- the word form.
- vltypealph
an alphabetical frequency list of
the words in the Flemish part of the
data. In the list a distinction is made between the different
components (speech types) in the corpus;
the columns list the following information:
- the rank order of the word
form in the Flemish part of the data;
- the total frequency of the
word form in components a-o;
- (...)
- the total frequency
of the word forms in the Flemish part of the data;
- the word form.
- typerank
a word frequency list presented as a
rank order list in which the 15 components (speech types) in the corpus
are distinguished; the columns list the following information:
- the rank order of the word
form, the highest ranking item occurring at the top of the list;
- the total frequency of the
word form per component (components a-o);
- (...)
- the total frequency
of the word form in the entire corpus;
- the word form.
- nltyperank
a frequency list of the words
in
the Dutch part of the data. The list is presented
as a rank order list and a distinction is made between the different
components (speech types) in the corpus;
the columns list the following information:
- the rank order of the word
form in the Dutch part of the data, the highest ranking item occurring
at the top of the list;
- the total frequency of the
word form per component (components a-o);
- (...)
- the total frequency
of the word form in the Dutch part of the corpus;
- the word form.
- vltyperank
a frequency list of the words
in
the Flemish part of the data. The list is presented
as a rank order list and a distinction is made between the different
components (speech types) in the corpus; the columns list the following
information:
- the rank order of the word
form in the Flemish part of the data, the highest ranking item
occurring at the top of the list;
- the total frequency of the
word form per component (components a-o);
- (...)
- the total frequency
of the word form in Flemish part of the data;
- the word form.
- tagalph
an alphabetical frequency list of all
POS tags; this list is structured as follows:
- [part-of-speech
frequency] [part-of-speech]
- [tag frequency per
part-of-speech] [POS tag]
- lemalph
a frequency list of lemmas with the
associated word forms and POS tags; this list is structured as follows:
- [NL-freq.
lemma] [VL-freq. lemma] [tot. freq.
lemma] [lemma]
- [NL-freq. word
form-tag] [VL-freq. word form-tag]
[tot. freq. word form-tag] [tag]
[word form]
- fonalph
a frequency list of word forms and
their phonetic transcriptions; this list is structured as follows:
- [NL-freq. word
form] [VL-freq. word form] [tot.
freq. word form] [word form]
- [NL-freq.
pron.] [VL-freq. pron.] [tot. freq.
pron.] [pron.]
Please note that this list is based
exclusively on the data in the corpus for which a manually verified
broad phonetic transcription is available.