BIROn - Birkbeck Institutional Research Online

BNC! Handle with care! Spelling and tagging errors in the BNC

Mitton, Roger and Hardcastle, David and Pedler, Jennifer (2007) BNC! Handle with care! Spelling and tagging errors in the BNC. In: Fourth Corpus Linguistics Conference, 27-30 July 2007, Birmingham, U.K..

[img] Text
corpling-jul07.doc
Restricted to Repository staff only

Download (75kB)
[img]
Preview
Text
591.pdf

Download (105kB) | Preview

Abstract

"You loose your no-claims bonus," instead of "You lose your no-claims bonus," is an example of a real-word spelling error. One way to enable a spellchecker to detect such errors is to prime it with information about likely features of the context for "loose" (verb) as compared with "lose". To this end, we extracted all the examples of "loose" used as a verb from the BNC (World edition, text). There were, apparently, 159 occurrences of "loose" (VVB or VVI). However, on inspection, well over half of these were not verbs at all (tagging errors) and over half of the rest were misspellings of "lose". Only about 15% were actual occurrences of "loose" as a verb. This prompted us to undertake a small investigation into errors in the BNC. We report on some words that occur more often as misspellings than in their own right - only one of the 63 occurrences of "ail", for example, is correct (possibly OCR errors) - and some words that are always mistagged, such as "haulier" and "glazier" (never NN), and "hanker" and "loiter" (never VV). We note in particular that, if a rare word resembles a common word (in spelling), it is more likely to appear as a misspelling of the common word than as a correct spelling of the rare word. These cases require some modification of an earlier conclusion (Damerau and Mays, 1989) on misspellings of rare words. We conclude with a discussion of the desirability, or otherwise, of correcting errors in corpora such as the BNC. The results may be of interest to people who use the BNC as training data or for teaching.

Metadata

Item Type: Conference or Workshop Item (Paper)
Keyword(s) / Subject(s): BNC, British National Corpus, spelling errors, misspellings, tagging errors, spellcheckers
School: Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences
Depositing User: John Mitton
Date Deposited: 11 Oct 2007
Last Modified: 08 Apr 2025 20:47
URI: https://eprints.bbk.ac.uk/id/eprint/591

Statistics

6 month trend
1,100Downloads
6 month trend
682Hits

Additional statistics are available via IRStats2.

Archive Staff Only (login required)

Edit/View Item
Edit/View Item