Scaling data capacity and throughput in encrypted deduplication with segment chunks and index locality

Ammons, Jaybe Mark (2024) Scaling data capacity and throughput in encrypted deduplication with segment chunks and index locality. PhD thesis, Birkbeck, University of London.

Preview

Text
Thesis_Final_Copy_Jaybe_Ammons.pdf - Full Version
Download (13MB) | Preview

Abstract

Encrypted deduplication backup systems are crucial in today’s data-driven world. Still, they face challenges such as excessive metadata storage in long-term backups and deduplication indexes that exceed available server memory. Heavy backup workloads can suffer from reduced throughput due to resource contention issues when concurrently deduplicating multiple client backup streams. This study introduces the SCAIL suite of algorithms - SCAIL, R-SCAIL, and their multiprocessor adaptations, P-SCAIL and PR-SCAIL - effectively addressing these issues. These algorithms achieve significant reductions in metadata storage and memory usage while also mitigating resource contention. This enables a significant scale-up of both data volume and concurrent client capacities, extending well beyond the limitations of conventional encrypted deduplication methods. Moreover, the SCAIL algorithms uniquely combine the data throughput advantages of coarse-grained segment-based deduplication with the high data compression of fine-grained chunk-based deduplication. The SCAIL suite adapts Metadedup’s metadata deduplication for client-side deduplication, leveraging a memory-based index that utilises fingerprints derived from the metadata of data segments. This strategy enables the rapid elimination of duplicate segments, significantly streamlining the deduplication process while also reducing both metadata uploads and the overall storage footprint. With this coarse level of deduplication, SCAIL may sometimes re-upload previously saved chunks. To mitigate this, R-SCAIL introduces resemblance-based, chunk-level client-side deduplication, effectively reducing redundant uploads. This refinement trades some of SCAILâ€™s speed for reduced upload volume, resulting in R-SCAIL operating at a somewhat slower throughput. For server-side deduplication, the SCAIL family adapts Sorted Deduplication’s index locality technique to perform exact, cross-client chunk-level deduplication in a very efficient, single sequential pass through the disk-based index. By harnessing multiprocessor server architectures, P-SCAIL and PR-SCAIL introduce data and task parallelism, significantly boosting throughput for deduplication processes on both the client-side and server-side deduplication. From our extensive evaluation of real-world backup datasets, we found that the SCAIL suite substantially reduces memory and storage requirements, thereby enhancing the server’s capacity to manage larger data volumes and more clients concurrently. P-SCAIL was found to achieve in the range of hundreds of GiB/second in client-side deduplication, and PR-SCAIL reached in the range of tens of GiB/second. Both systems are compatible with the throughput transfer rates of modern hard-disk drives during server-side deduplication. The resulting high throughput, low memory, and storage requirements of the SCAIL family make a significant advancement to the field of encrypted deduplication.

Metadata

Item Type:	Thesis
Copyright Holders:	The copyright of this thesis rests with the author, who asserts his/her right to be known as such according to the Copyright Designs and Patents Act 1988. No dealing with the thesis contrary to the copyright or moral rights of the author is permitted.
Depositing User:	Acquisitions And Metadata
Date Deposited:	20 Dec 2024 12:27
Last Modified:	18 Sep 2025 00:21
URI:	https://eprints.bbk.ac.uk/id/eprint/54728
DOI:	https://doi.org/10.18743/PUB.00054728

Statistics

DownloadsShow export options

Activity Overview

6 month trend

119Downloads

6 month trend

296Hits

Additional statistics are available via IRStats2.

Archive Staff Only (login required)

Edit/View Item