What is data deduplication, and how is it implemented? | Virtual Reality

On Aug 20, 2018

Deduplication is arguably the biggest advancement in backup technology in the last two decades. It is single-handedly responsible for enabling the shift from tape to disk for the bulk of backup data, and its popularity only increases with each passing day. Understanding the different kinds of deduplication, also known as dedupe, is important for any person looking at backup technology.

What is data deduplication?

Dedupe is the identification and elimination of duplicate blocks within a dataset. It is similar to compression, which only identifies redundant blocks in a single file. Deduplication can find redundant blocks of data between files from different directories, different data types, even different servers in different locations.

For example, a dedupe system might be able to identify the unique blocks in a spreadsheet and back them up. If you update it and back it up again, it should be able to identify the segments that have changed and only back them up. Then if you email it to a colleague, it should be able to identify the same blocks in your Sent Mail folder, their Inbox and even on their laptop’s hard drive if they save it locally. It will not need to back up these additional copies of the same segments; it will only identify their location.

How does deduplication work?

The usual way that dedupe works is that data to be deduped is chopped up into what most call chunks. A chunk is one or more contiguous blocks of data. Where and how the chunks are divided is the subject of many patents, but suffice it to say that each product creates a series of chunks that will then be compared against all previous chunks seen by a given dedupe system.

The way the comparison works is that each chunk is run through a deterministic cryptographic hashing algorithm, such as SHA-1, SHA-2, or SHA-256, which creates what is called a hash. For example, if one enters “The quick brown fox jumps over the lazy dog” into a SHA-1 hash calculator, you get the following hash value:
2FD4E1C67A2D28FCED849EE1BB76E7391B93EB12

(You can try this yourself here: https://passwordsgenerator.net/sha1-hash-generator/)

If the hashes of two chunks match, they are considered identical, because even the smallest change causes the hash of a chunk to change. A SHA-1 hash is 160 bits. If you create a 160-bit hash for an 8 MB chunk, you save almost 8 MB every time you back up that same chunk. This is why dedupe is such a space saver.

Target dedupe

Target dedupe is the most common type of dedupe sold on the market today. The idea is that you buy a target dedupe disk appliance and send your backups to its network share or to virtual tape drives if the product is a virtual tape library (VTL). The chunking and comparison steps are all done on the target; none of it is done on the source. This allows you to get the benefits of dedupe without changing your backup software.

This incremental approach allowed many companies to switch from tape to disk as their primary backup target. Most customers copied the backups to tape for offsite purpose. Some advanced customers with larger budgets used the replication abilities of these target dedupe appliances to replicate their backups offsite. A good dedupe system would reduce the size of a typical file by 99%, and the size of an incremental backup by 90%, making replication of all backups possible. (Within reason, of course. Not everyone has enough bandwidth to handle this level of replication.)

Source dedupe

Source dedupe happens on the backup client – at the source – hence the name source, or client-side dedupe. The chunking process happens on the client, and then it passes the hash value to the backup server for the lookup process. If the backup server says a given chunk is unique, the chunk will be transferred to the backup server and written to disk. If the backup server says a given chunk has been seen before, it doesn’t even need to be transferred. This saves bandwidth and storage space.

One criticism of source dedupe is that the process of creating the hash is a resource-intensive operation requiring a lot of CPU power. While this is true, it is generally offset by a significant reduction in the amount of CPU necessary to transfer the backup, since more than 90% of all chunks will be duplicates on any given backup.

The bandwidth savings also allow source dedupe to run where target dedupe cannot run. For example, it allows companies to back up their laptops or mobile devices, all of which are using the Internet as their bandwidth. Backing up such devices with a target dedupe system would require an appliance local to each device being backed up. This is why source dedupe is the preferred method for remote backup.

There aren’t as many installations of source dedupe in the field as there are target dedupe, for several reasons. One reason is that target dedupe products have been out and stable longer than most source dedupe products. But perhaps the biggest reason is that target dedupe can be implemented incrementally (i.e. using the same backup software and just changing the target), where source dedupe usually requires a wholesale replacement of your backup system. Finally, not all source-dedupe implementations are created equal, and some had a bit of a rocky road along the way.

Advantages, disadvantages of deduplication

The main advantage of target dedupe is that you can use it with virtually any backup software, as long as it is one the appliance supports. The downside is that you need an appliance everywhere you’re going to back up, even if it’s just a virtual appliance. The main advantage of source dedupe is the opposite; you can backup literally from anywhere. This flexibility can create situations where backups meet your needs but restore speeds don’t, so make sure to take that into consideration.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.