Eve, Martin Paul (2024) Evaluating Document Similarity Detection Approaches for Content Drift Detection. eve.gd ,
Text
2024-12-21-evaluating-document-similarity-detection-approaches.markdown - Published Version of Record Available under License Creative Commons Attribution. Download (36kB) |
Abstract
“Content drift” is an important concept for digital preservation and web archiving. Scholarly readers expect to find immutable (“persisted”) content at the resolution endpoint of a DOI. It is a matter of research integrity that research articles should remain the same at the endpoint, as citations can refer to specific textual formulations. Detecting content drift is not an easy task as it is not just a matter of seeing whether a web page has changed. HTML pages can change in layout, but retain exactly the same content in a way that is obvious to humans but not to machines. Thus, there is a requirement to parse web pages into plaintext versions; itself a difficult task due to JavaScript components and CAPTCHA systems. Then this preprocessing must discard dynamic user content, such as comments and annotations that can interfere with document similarity detection. This post details the performance and resources required by a range of approaches to document similarity detection once these preprocessing elements have been completed. The algorithms evaluated include older systems such as Jaccard Similarity right through to modern machine learning and AI models. The conclusion is that earlier algorithms provide a better measure of “content drift” but a worse semantic understanding of documents. However, we do not want a “semantic understanding” of document similarity in this use case. For a person “This jar is empty” and “This jar has nothing in it” are the same thing, semantically. But for the purpose of working out whether the document has changed in a scholarly context, we would say it has drifted, because someone might have cited “This jar is empty” and expects to find this formulation at the persisted endpoint. Hence, machine learning models that perceive semantic formulations as “similar” fare worse for the job of content drift detection than earlier algorithms that detect lexical change and containment.
Metadata
Item Type: | Article |
---|---|
School: | Birkbeck Faculties and Schools > Faculty of Humanities and Social Sciences > School of Creative Arts, Culture and Communication |
Depositing User: | Martin Eve |
Date Deposited: | 21 Dec 2024 11:12 |
Last Modified: | 21 Dec 2024 11:12 |
URI: | https://eprints.bbk.ac.uk/id/eprint/54734 |
Statistics
Additional statistics are available via IRStats2.