BIROn - Birkbeck Institutional Research Online

    BNC! Handle with care! Spelling and tagging errors in the BNC

    Mitton, Roger and Hardcastle, David and Pedler, Jennifer (2007) BNC! Handle with care! Spelling and tagging errors in the BNC. In: Fourth Corpus Linguistics Conference, 27-30 July 2007, Birmingham, U.K..

    [img] Text
    corpling-jul07.doc
    Restricted to Repository staff only

    Download (75kB)
    [img]
    Preview
    Text
    591.pdf

    Download (105kB) | Preview

    Abstract

    "You loose your no-claims bonus," instead of "You lose your no-claims bonus," is an example of a real-word spelling error. One way to enable a spellchecker to detect such errors is to prime it with information about likely features of the context for "loose" (verb) as compared with "lose". To this end, we extracted all the examples of "loose" used as a verb from the BNC (World edition, text). There were, apparently, 159 occurrences of "loose" (VVB or VVI). However, on inspection, well over half of these were not verbs at all (tagging errors) and over half of the rest were misspellings of "lose". Only about 15% were actual occurrences of "loose" as a verb. This prompted us to undertake a small investigation into errors in the BNC. We report on some words that occur more often as misspellings than in their own right - only one of the 63 occurrences of "ail", for example, is correct (possibly OCR errors) - and some words that are always mistagged, such as "haulier" and "glazier" (never NN), and "hanker" and "loiter" (never VV). We note in particular that, if a rare word resembles a common word (in spelling), it is more likely to appear as a misspelling of the common word than as a correct spelling of the rare word. These cases require some modification of an earlier conclusion (Damerau and Mays, 1989) on misspellings of rare words. We conclude with a discussion of the desirability, or otherwise, of correcting errors in corpora such as the BNC. The results may be of interest to people who use the BNC as training data or for teaching.

    Metadata

    Item Type: Conference or Workshop Item (Paper)
    Keyword(s) / Subject(s): BNC, British National Corpus, spelling errors, misspellings, tagging errors, spellcheckers
    School: Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences
    Depositing User: John Mitton
    Date Deposited: 11 Oct 2007
    Last Modified: 09 Aug 2023 12:29
    URI: https://eprints.bbk.ac.uk/id/eprint/591

    Statistics

    Activity Overview
    6 month trend
    1,062Downloads
    6 month trend
    619Hits

    Additional statistics are available via IRStats2.

    Archive Staff Only (login required)

    Edit/View Item Edit/View Item