BIROn - Birkbeck Institutional Research Online

    Applications of deep learning and statistical methods for a systems understanding of convergence in immune repertoires

    Moghimi, Pejvak Abbas Zadeh (2023) Applications of deep learning and statistical methods for a systems understanding of convergence in immune repertoires. PhD thesis, Birkbeck, University of London.

    Pejvak_Moghimi_Thesis_BBK_lib_sub.pdf - Full Version

    Download (7MB) | Preview


    Deep learning and adaptive immune receptor repertoire (AIRR) biology are two emerging fields that are highly compatible due to the inherent complexity of the immune systems and the enormous amount of data produced in AIRR-sequencing research combined with the revolutionary success of deep learning technology to make predictions about high dimensional complex systems/data. We took steps towards the effective utilisation of and statistical methods in repertoire immunology by undertaking one of the central problems in immunology, i.e. immune repertoire convergence. First, we took part in developing and testing an array of summary statistics for immune repertoires to gain insights into the descriptive features of immune repertoires and grant us the ability to compare repertoires. We collected the deepest sequencing datasets to address whether the population-wide genomic convergence of immunoglobulin molecules can be predicted. The immunoglobulin molecules were labelled with their “degree of commonality” (DoC), defined as the number of times an immunoglobulin V3J clonotype is observed in a population, where a V3J clonotype is defined by its V and J genes and CDR3 sequence. We developed various bespoke data analytics methods, informed at different stages by the summary statistics we had previously implemented. Importantly, we demonstrated that machine learning (ML) predictions for immune repertoires could lead to misleadingly positive outcomes if data is processed inappropriately due to “data leakage” and addressed this issue by implementing a leak-free data processing pipeline. Here, data leakage refers to immunoglobulin sequences with the same clonotype definition spreading across the train-validation-test splits in the ML task. We designed a multitude of bespoke deep neural network architectures, implemented under various modelling approaches, including a customised squeeze-and-excitation temporal convolutional neural network (SE-TCN) and a Transformer model. Unsurprisingly, given the continuous spectrum of DoCs, regression modelling proved to be the best approach, both in the granularity of predictions and error distribution. Finally, we report that our SE-TCN architecture under the regression modelling framework achieves state-of-the-art performance by achieving an overall mean absolute error (MAE) score of 0.083 and per-DoC error distributions with reasonably small standard deviations.


    Item Type: Thesis
    Copyright Holders: The copyright of this thesis rests with the author, who asserts his/her right to be known as such according to the Copyright Designs and Patents Act 1988. No dealing with the thesis contrary to the copyright or moral rights of the author is permitted.
    Depositing User: Acquisitions And Metadata
    Date Deposited: 21 Sep 2023 10:55
    Last Modified: 01 Nov 2023 16:18


    Activity Overview
    6 month trend
    6 month trend

    Additional statistics are available via IRStats2.

    Archive Staff Only (login required)

    Edit/View Item Edit/View Item