BIROn - Birkbeck Institutional Research Online

    Bootstrap domain-specific sentiment classifiers from unlabeled corpora

    Mudinas, Andrius and Zhang, Dell and Levene, Mark (2018) Bootstrap domain-specific sentiment classifiers from unlabeled corpora. Transactions of the Association for Computational Linguistics 6 , pp. 269-285. ISSN 2307-387X.

    [img] Text
    22000A.pdf - Author's Accepted Manuscript
    Restricted to Repository staff only
    Available under License Creative Commons Attribution.

    Download (579kB)
    [img]
    Preview
    Text
    22000.pdf - Published Version of Record
    Available under License Creative Commons Attribution.

    Download (390kB) | Preview

    Abstract

    There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibility of building domain-specific sentiment classifiers with unlabeled documents only. Our investigation indicates that in the word embeddings learned from the unlabeled corpus of a given domain, the distributed word representations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Exploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific sentiment lexicon from just a few typical sentiment words ("seeds"). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophisticated semi-supervised/transductive learning algorithms which represent the state-of-the-art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment classification, but a higher performance could be achieved through a two-phase bootstrapping method which uses the induced lexicon to assign positive/negative sentiment scores to unlabeled documents first, and then uses those documents found to have clear sentiment signals as pseudo-labeled examples to train a document sentiment classifier via supervised learning algorithms (such as LSTM). On several benchmark datasets for document sentiment classification, our end-to-end pipelined approach which is overall unsupervised (except for a tiny set of seed words) outperforms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches.

    Metadata

    Item Type: Article
    School: Birkbeck Faculties and Schools > Faculty of Science > School of Computing and Mathematical Sciences
    Research Centres and Institutes: Birkbeck Knowledge Lab, Data Analytics, Birkbeck Institute for
    Depositing User: Dell Zhang
    Date Deposited: 21 May 2018 08:52
    Last Modified: 09 Aug 2023 12:43
    URI: https://eprints.bbk.ac.uk/id/eprint/22000

    Statistics

    Activity Overview
    6 month trend
    304Downloads
    6 month trend
    242Hits

    Additional statistics are available via IRStats2.

    Archive Staff Only (login required)

    Edit/View Item Edit/View Item