AI-Trainingset - Tag de Tekst voor Named Entity Recognition (NER)

De AI-trainingset voor NER is in 2020 gemaakt door de circa 150 vrijwilligers van het crowdsourcingsproject "Tag de tekst" op VeleHanden.nl. Persoonsnamen, locaties en tijdsaanduidingen zijn geannoteerd in al eerder ontwikkelde Ground Truth-transcripties (GT-transcripties) van 10.567 scans en gecontroleerd door drie ervaren super users. Een uitgebreide beschrijving van de gehanteerde definities is te vinden in de invoerinstructie van "Tag de tekst". De Nederlandstalige teksten komen uit de 17e eeuw tot en met de 19e eeuw. Het gaat om notariële teksten uit Amsterdam, Haarlem en uit zeven andere provincies en archieven van de Verenigde Oost-Indische Compagnie (VOC). Ze zijn afkomstig uit het Stadsarchief Amsterdam, het Nationaal Archief, het Noord-Hollands Archief, en zeven andere Regionaal Historische Centra: Tresoar, het Gelders Archief, de Groningen Archieven, het Brabants Historisch Informatie Centrum, het Zeeuws Archief, het Historisch Centrum Limburg, Het Utrechts Archief en de Collectie Overijssel. De AI-trainingset is ontwikkeld i.h.k.v. de projecten "De IJsberg zichtbaar maken" (zoekintranscripties.nl) en "Slimmer zoeken in archieven" (archieveninbeeld.nl).

The AI training set for NER was created in 2020 by approx. 150 volunteers of the crowdsourcing project ‘Tag de tekst’ on VeleHanden.nl. Personal names, locations and timestamps were annotated in previously developed Ground Truth transcripts (GT transcripts) of 10,567 scans and checked by three experienced super users. A detailed description of the definitions used can be found in the input instruction of ‘Tag de tekst’. The Dutch-language texts are from the 17th century to the 19th century. They are notarial texts from Amsterdam, Haarlem and from seven other provinces and archives of the United East India Company (VOC). They are from the Stadsarchief Amsterdam, the Nationaal Archief, the Noord-Hollands Archief, and seven other Regionaal Historische Centra: Tresoar, the Gelders Archief, the Groningen Archives, the Brabants Historisch Informatie Centrum, the Zeeuws Archief, the Historisch Centrum Limburg, Het Utrechts Archief and the Overijssel Collection. The AI training set was developed as part of the projects ‘De IJsberg zichtbaar maken’ (zoekintranscripties.nl) and ‘Smarter searching in archives’ (archieveninbeeld.nl).

Productdetails

Dataformaat	XML
Jaar	2022
Project	"Tag de Tekst" op VeleHanden.nl
Gerealiseerd door	Picturae, Aincient, Sioux Technologies, Islands of Meaning, de deelnemende archieven en vrijwilligers van VeleHanden.nl
Documentatie	Toelichting
Refereren	AI-Trainingset for NER (Version 1.0) (2022) [Data set]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-v2
Talen	Nederlands
Licentie	Creative Commons Naamsvermelding 4.0 Internationaal-licentie
Versie	1.0

Downloaddetails

Bestand
AITrainingset1.0.zip

Aantal bestanden 1
Aantal downloads 112
Bestandsgrootte 12.90 MB
Datum plaatsing 18/05/2022
Laatst bijgewerkt 10/06/2025
Versie 1.0