LeTTuce-PoS Dataset (Download) - INT Taalmaterialen

De LeTTuce-PoS-dataset is een meertalig benchmarkcorpus voor part-of-speech tagging in verschillende gegevensgenres en domeinen, waaronder sociale media, branchereviews (FMCG, human resources, hotels, luchtvaartmaatschappijen), technische en historische teksten. De gegevens zijn handmatig geannoteerd door drie getrainde taalkundigen en zijn bedoeld als benchmark voor de evaluatie en vergelijking van PoS-taggingprestaties.

The LeTTuce-PoS dataset is a multilingual benchmark corpus for part-of-speech tagging across different data genres and domains, encompassing social media, industry reviews (FMCG, human resources, hotel, airline), technical and historical texts. The data were manually annotated by three trained linguists and are intended to serve as a benchmark for the evaluation and comparison of PoS tagging performance.

Productdetails

Versie	1.0
Jaar	2026
Eigenaar	LT3-Language and Translation Technology Team - Universiteit Gent
Financier	FWO
Project	CLARIAH-VL: Open Humanities Service Infrastructure
Paper	Van Hee, C., Doumen, J., Prins, V., Singh, P., Vandeghinste, V., & Lefever, E. (2026). TextLens & LeTTuce: Automated Corpus Annotation and Multilingual Tagging as a Service. In Proceedings of the 2026 International Conference on Computational Linguistics, Language Resources and Evaluation (LREC 2026). ELRA.
Aantal tokens	109.082 (Nederlands: 39.630, Engels: 33.131, Frans: 27.623, Duits: 8.698)
Talen	Nederlands, Engels, Duits en Frans
Refereren	Van Hee, C., & Lefever, E. (2026). LeTTuce-PoS Dataset (Version 1.0) (2026) Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a3-d2
Licentie	Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Downloaddetails

Bestand
LeTTuce-PoS_1.0.zip

Aantal bestanden 1
Aantal downloads 2052
Bestandsgrootte 96,432.56 MB
Datum plaatsing 16/10/2025
Laatst bijgewerkt 30/03/2026
Versie 2.0.3