######################### # Various SoNaR Corpora # ######################### The STEVIN SoNaR project has resulted in two datasets, viz. SoNaR-500 and SoNaR-1. SONAR-500 contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types including both texts from conventional media and texts from the new media. All texts have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. In the case of SoNaR-500 all annotations were produced automatically, no manual verification took place. SoNaR-1 is a dataset comprising one million words. Although largely a subset of SoNaR-500, SoNaR-1 includes far fewer text types. With SoNaR-1 different types of semantic annotation have been provided, viz. named entity labelling, annotation of co-reference relations, semantic role labelling and annotation of spatial and temporal relations. All annotations have been manually verified. 1. For non-commercial use/organisations, the following SoNaR products are availabe: ----------------------------------------------------------------------------------- 1.1 The SoNaR Corpus NC, which contains the entire dataset SONAR-1 and the dataset SONAR-500, except for the text categories tweets, chats and sms. 1.2 The SoNaR New Media Corpus, which contains the text categories tweets, chats and sms (WR-P-E-L_tweets, WR-U-E-A_chats and WR-U-E-D_sms). 2. For commercial use/organisations the following products are available (due to IPR reasons, some texts cannot be used commercially): ------------------------------------------------------------------------------------------------------------------------------------ 2.1 The SoNaR Large Corpus Commercial, which contains the commercial version of the dataset SONAR-500. Due to IPR reasons, the text categories WR-P-E-L_tweets, WR-U-E-A_chats, WR-U-E-D_sms and WS-U-E-E_written_assignments were removed from the SONAR-500 dataset and furthermore, also due to IPR reasons, a number of text categories contain less data. It concerns the categories WR-P-P-B_books (22 files removed), WR-P-P-G_newspapers (552.448 files removed) and WR-P-P-H_periodicals_magazines (147.480 files removed). This corpus contains approx. 271 million words. 2.2 The SoNaR Small Corpus Commercial, which contains the commercial version of the dataset SONAR-1. Due to IPR reasons some texts had to be removed from SONAR-1 (382 texts removed from a total of 861 texts or approx. 176.000 words removed from a total of 1 million words). This corpus contains approx. 825.000 words.