BIROn - Birkbeck Institutional Research Online

    Finding parallel passages in cultural heritage archives

    Harris, Martyn and Levene, Mark and Zhang, Dell and Levene, D. (2018) Finding parallel passages in cultural heritage archives. ACM Journal on Computing and Cultural Heritage 11 (3), ISSN 1556-4673.

    [img]
    Preview
    Text
    samtla_jocch_paper.pdf - Author's Accepted Manuscript

    Download (1MB) | Preview
    [img] Text
    21385a.pdf - Published Version of Record
    Restricted to Repository staff only

    Download (8MB) | Request a copy

    Abstract

    It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). In this paper, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. The system has already been used to support research on five large text corpora that span a number of different domains and languages. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimised suffix tree, generalised edit distance. A comprehensive evaluation through crowd-sourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance.

    Metadata

    Item Type: Article
    Additional Information: © ACM, 2018. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is published at the link above.
    Keyword(s) / Subject(s): digital archives, information retrieval, statistical language models, suffixes trees
    School: Birkbeck Schools and Departments > School of Business, Economics & Informatics > Computer Science and Information Systems
    Research Centre: Birkbeck Knowledge Lab
    Depositing User: Dr Dell Zhang
    Date Deposited: 27 Feb 2018 11:54
    Last Modified: 24 Aug 2019 17:41
    URI: http://eprints.bbk.ac.uk/id/eprint/21385

    Statistics

    Downloads
    Activity Overview
    251Downloads
    135Hits

    Additional statistics are available via IRStats2.

    Archive Staff Only (login required)

    Edit/View Item Edit/View Item