Museums are increasingly presenting their collections in digital form. Many of the terms used in the (mostly historical) descriptions of objects and collections reflect once-dominant narratives that are now outdated and in many cases openly racist. For this reason, museums are addressing this issue and replacing object names now regarded as inappropriate, for instance. But technological developments likewise require a responsible approach to problematic historical terminology. This article examines how museums contextualise discriminatory language on the web – in their digital collections, for example – and discusses how they can prevent the multiplication of inappropriate content by AI systems, with the ultimate aim of aligning historical descriptions and ethically responsible media use.
The risks involved in AI training data from museum holdings
AI systems are trained on vast quantities of data from the internet, gathered by automated web crawlers. If discriminatory terms in museum data are inadequately contextualised, there is a genuine risk of them being adopted uncritically by AI, thereby perpetuating stereotypes or discriminatory narratives. Studies confirm that modern AI systems can reproduce racist, colonialist or stereotypical content if such material is present in their training data (UNESCO and IRCAI 2024; Nesterov, Hollink, Erp et al. 2023a; Nesterov, Hollink, Erp et al. 2023b). This poses a moral challenge for cultural institutions: they are called upon to preserve knowledge of historical designations while at the same time preventing their re-use in modern contexts.
Museums could consider refusing access to their texts and images by such crawlers in order to avoid unwanted use of their data, for example. Companies such as Cloudflare are already developing tools that allow users to block all AI bots, scrapers and crawlers (Bocharov, Vargas, Martinetti et al. 2024). If unwanted crawlers nevertheless gain access to a site, the company offers solutions to confuse and deliberately slow them down with AI-generated content (Tatoris, Saxena and Miglietti 2025).
However, such blanket blocking of crawlers would also mean that AI systems such as ChatGPT or Google Gemini – now widely used by large user groups (also in education, for instance) – would not be able to include museum content. So a general elimination of AI crawlers is not really a viable option after all, since public cultural institutions are obliged to make their knowledge available to society. What is more, museum content can serve as a corrective to speculation and misinformation, so access by web crawlers is in fact desirable.
One possible solution could lie in the implementation of mechanisms to monitor access to specific content and thus safeguard sensitive museum data once published. Several instruments are available for this purpose: the Robots.txt file („The Web Robots Pages“, n. d.) allows for basic control by blocking or permitting crawler access to defined sections of a website. Meanwhile the X-Robots-Tag („X-Robots-Tag - HTTP | MDN“, n. d.), offers more precise options by using an HTTP header to define for individual pages whether and how AI systems may record them. A third option is the more recent TDM Reservation Protocol („TDM Reservation Protocol (TDMRep)“, n. d.) which permits even more fine-grained control of conditions for text and data mining applications. Nonetheless, the effectiveness of these solutions depends on voluntary compliance by the crawlers, and their implementation requires specific technical expertise. Binding enforcement of such usage conditions currently appears possible only through complete blocking – which, as noted, would work against the public educational mandate of museums and prevent the valuable knowledge they preserve from being available to modern, AI-supported information channels.
Contextualising problematic terms
Museums should therefore ensure that their content is clearly contextualised and deliberately communicated so that it can serve as a valuable resource against bias and misinformation – both for human users and for AI systems.
As such, the contextualisation of historically problematic terms is at the core of a responsible approach to discriminatory language in museum data. Unlike complete removal or uncritical retention, contextualisation makes it possible to preserve historical accuracy while at the same time ensuring critical assessment. It is essential in this process to implement a multi-layered metadata structure. In the case of “title contextualisation” (Mähr and Schnegg 2024), such a structure might consist of a primary layer with contemporary, non-discriminatory terms, while historical designations are recorded on a secondary layer with clear indications of their historical context and their problematic nature. This process should begin in museum databases themselves and extend through to public release. Practically speaking, this means that colonial-era object names or collection categories such as “exotic” or “primitive” should not be deleted but instead marked with attributes such as “historical designation” or “colonial-era term” and supplemented with explanatory notes. To assist with the identification of such terms, the DE-BIAS tool („The DE-BIAS Tool“, n. d.) can be used, which was specifically developed to detect problematic language in cultural collections.
In addition, machine-readable annotations – for example using RDFa („RDFa“, n. d.) or JSONLD („JSON-LD - JSON for Linking Data“, n. d.) – can be applied to mark up content and its semantic structures in a clear and systematic way. Such procedures enable technical systems to record and represent multiple dimensions of data, such as temporal contexts, content relationships and further metadata. In connection with the DE-BIAS vocabulary („The DE-BIAS Vocabulary“, n. d.; „DE-BIAS Vocabulary“ 2025), this opens up the possibility of annotating problematic terms consistently and transparently. However, there are currently no binding standards that would allow for the systematic designation of historical or problematic terms as “historical title” or “colonial-era description”, for instance. This lack of standardisation makes it more difficult to provide automated, unambiguous and machine-readable contextualisation of problematic language. Moreover, AI crawlers fundamentally record all available information without filtering out specific terms. For this reason, it is important to ensure that problematic terms are always accompanied by the necessary contextual information – even if only in the main text and without special technical annotation – so that AI systems are able to process these terms in context. This multi-layered contextualisation allows museums to fulfil their historical responsibility while at the same time reducing the likelihood of AI systems reproducing discriminatory narratives in an uncritical fashion.
A responsible approach to historically charged and discriminatory terms in digital museum holdings is indispensable in the age of AI. Museums face the dual challenge of safeguarding historical authenticity while at the same time preventing the spread of problematic narratives through AI systems. The contextual embedding of problematic terms and transparent communication of their historical meaning are essential steps. Employing a multi-layered metadata structure, machine-readable annotations and supporting tools such as DE-BIAS provides workable solutions to meeting these challenges. In this way, museums can meet their social responsibility while also serving as a reliable source of information for both human users and AI systems.
Jamie Dau is Adviser for Provenance and Archives at the Reiss-Engelhorn-Museen.
Leslie Zimmermann is Adviser for AI and Digital Strategy at the Reiss-Engelhorn-Museen.
References:
Bocharov, Alex, Santiago Vargas, Adam Martinetti, Reid Tatoris, and Carlos Azevedo. 2024. Declare Your AIndependence Block AI Bots, Scrapers and Crawlers with a Single Click. The Cloudflare Blog (blog). 3 July 2024. https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/.
Mähr, Moritz, and Noëlle Schnegg. 2024. Handbuch zur Erstellung diskriminierungsfreier Metadaten für historische Quellen und Forschungsdaten, 3 June 2024. https://doi.org/10.5281/zenodo.11124720.
Nesterov, Andrei, Laura Hollink, Marieke van Erp and Jacco van Ossenbruggen. 2023a. A Knowledge Graph of Contentious Terminology for Inclusive Representation of Cultural Heritage. In The Semantic Web, edited by Catia Pesquita, Ernesto Jimenez-Ruiz, Jamie McCusker, Daniel Faria, Mauro Dragoni, Anastasia Dimou, Raphael Troncy and Sven Hertling, 502–19. Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-33455-9_30.
Nesterov, Andrei, Laura Hollink, Marieke van Erp and Jacco van Ossenbruggen. 2023b. cultural-ai/wordsmatter: Words Matter: a knowledge graph of contentious terms. Zenodo. https://doi.org/10.5281/zenodo.7713157.
Tatoris, Reid, Harsh, Saxena and Luis Miglietti. 2025. Trapping Misbehaving Bots in an AI Labyrinth. The Cloudflare Blog (blog). 19 March 2025. https://blog.cloudflare.com/ai-labyrinth/.
UNESCO and IRCAI. 2024. Challenging Systematic Prejudices: An Investigation into Gender Bias in Large Language Models. https://unesdoc.unesco.org/ark:/48223/pf0000388971.