SoNaR-corpus - INT Taalmaterialen

Het SoNaR-corpus is een tekstcorpus dat bestaat uit twee delen, nl. SoNaR-500 en SoNaR-1.

SoNaR-500 bevat meer dan 500 miljoen woorden tekst afkomstig uit uiteenlopende domeinen en genres. Alle teksten werden getokeniseerd, ge-POS-tagd en gelemmatiseerd. Ook de named entities werden gelabeld. Alle annotaties van SoNaR-500 werden automatisch geproduceerd.

SoNaR-1 is grotendeels een subset van SoNaR-500 en bevat 1 miljoen woorden. SoNaR-1 werd voorzien van verschillende soorten semantische annotaties, nl. named entity labelling, coreferentieannotatie en de annotatie van spatiële en temporele relaties. Alle annotaties van SoNaR-1 werden manueel geverifieerd.

De nieuwemediateksten (tweets, chats en sms'en), die ook verzameld werden in het kader van het STEVIN-project SoNaR maken geen deel uit van het SoNaR-corpus 1.0. en zijn apart als het SoNaR Nieuwe Media Corpus beschikbaar.

Het SoNaR-corpus is ook online te bevragen. Zie OpenSoNaR.

Voor commercieel gebruik zie SoNaR Klein-corpus commercieel en SoNaR Groot-corpus commercieel.

The SoNaR corpus is a text corpus consisting of two parts, i.e. SoNaR-500 and SoNaR-1.

SoNaR-500 contains more than 500 million words of text from a variety of domains and genres. All texts were tokenised, POS-tagged and labelled. Named entities were also labelled. All annotations of SoNaR-500 were produced automatically.

SoNaR-1 is largely a subset of SoNaR-500 and contains 1 million words. SoNaR-1 was provided with several types of semantic annotations, namely named entity labelling, coreference annotation and the annotation of spatial and temporal relationships. All annotations of SoNaR-1 were manually verified.

The new media texts (tweets, chats and text messages), which were also collected as part of the STEVIN project SoNaR are not part of the SoNaR corpus 1.0. and are available separately as the SoNaR New Media Corpus.

The SoNaR corpus can also be queried online. See OpenSoNaR.

For commercial use, see SoNaR Klein-corpus commercieel and SoNaR Groot-corpus commercieel..

Productdetails

Documentatie	Documentatie; Verschillende SoNaR-corpora
Eigenaar	Taalunie
Financier	NTU\|STEVIN
Jaar	2015
Opdrachtgever	NTU\|STEVIN
Project	SoNaR
Projectwebsite	https://lands.cls.ru.nl/projects/SoNaR/description.html
Refereren	SoNaR-corpus (Version 1.2.1) (2015) [Data set]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-h5
Talen	Nederlands
Versie	1.2.1

Downloaddetails

Bestand
SoNaRCorpus_NC_1.2.1.tgz

Aantal bestanden 1
Aantal downloads 1028
Bestandsgrootte 58,823.59 MB
Datum plaatsing 04/09/2020
Laatst bijgewerkt 23/01/2026
Versie 1.2.1