BIROn - Birkbeck Institutional Research Online

    Learning to separate text content and style for classification

    Zhang, Dell and Lee, W.S. (2006) Learning to separate text content and style for classification. In: Ng, H.T. and Leong, M.-K. and Kan, M.-Y. and Ji, D.-H. (eds.) AIRS 2006: Information Retrieval Technology, Third Asia Information Retrieval Symposium. Lecture Notes in Computer Science 4182. Springer, pp. 79-91. ISBN 9783540457800.

    Full text not available from this repository.

    Abstract

    Many text documents naturally have two kinds of labels. For example, we may label web pages from universities according to their categories, such as “student” or “faculty”, or according the source universities, such as “Cornell” or “Texas”. We call one kind of labels the content and the other kind the style. Given a set of documents, each with both content and style labels, we seek to effectively learn to classify a set of documents in a new style with no content labels into its content classes. Assuming that every document is generated using words drawn from a mixture of two multinomial component models, one content model and one style model, we propose a method named Cartesian EM that constructs content models and style models through Expectation Maximization and performs classification of the unknown content classes transductively. Our experiments on real-world datasets show the proposed method to be effective for style independent text content classification.

    Metadata

    Item Type: Book Section
    School: Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences
    Depositing User: Sarah Hall
    Date Deposited: 15 Nov 2021 13:52
    Last Modified: 09 Aug 2023 12:52
    URI: https://eprints.bbk.ac.uk/id/eprint/46724

    Statistics

    Activity Overview
    6 month trend
    0Downloads
    6 month trend
    88Hits

    Additional statistics are available via IRStats2.

    Archive Staff Only (login required)

    Edit/View Item Edit/View Item