Deduplication and Compression with Red Hat Linux 8 Using VDO

Virtual Data Optimizer (VDO) is a block virtualization technology that provides transparent deduplication of data. By eliminating redundant chunks of data, VDO can greatly reduce actual used disk capacity. The CentOS implementation of VDO is quite good, but there are some caveats to be aware of, especially when you want filesystems on VDO to come up automatically at boot. If you do it wrong, your system will not boot! So make sure to read all the way to the end to learn how to avoid ending up in this situation!

Used Commands

  • vdo – This command is used to create, remove, start, and stop VDO volumes, as well as performing other configuration changes.
  • vdostats – This command is used to report on various aspects of VDO volumes, including effective reduction and physical volume utilization. Think of this as ‘df’ for VDO capacity.

To Use VSO in a Linux Server:

Install VDO:
# yum -y install vdo

Make sure you have a free block device (or create a new one) with 1GB in size
# fdisk /dev/sdX                     (where XX is your new block device or partition number).

Create the empty VDO volume on top of the new device that you just created
# vdo create –name=vdo1 –device=/dev/sdX –vdoLogicalSize=3G –force

Here is a breakdown of the various options above:

  • create – This is telling the VDO command what operation we want to do. You can use “remove” to remove the VDO volume.
  • name=vdo1 – the name we want to give to your volume
  • device=/dev/sdX – the underlying device we want to create the VDO volume.
  • vdoLogicalSize=120G – Instruct VDO that the effective capacity we want to expose to the OS is 3GB. Though your physical device is only 1GB, we are assuming that we will get at least a 3:1 reduction from deduplication. For most data, this is pretty conservative, but if your data does not deduplicate well, then your ratio should be different. In general, log files and other plain text files will deduplicate very well, and you may get 10:1 or even higher deduplication rates. But binary files, and especially pre-compressed data such as video, audio, or compressed archives, will get far less than 3:1 or even 1:1 in some cases! Do not use VDO for this type of data.

Now VDO has created a new Device Mapper device called /dev/mapper/vdo1. To investigate the New VDO Volume:
# ls -l /dev/mapper/vdo1

Check the created VDO volume size and free space.
# vdostats –hu

The –hu flag is shorthand for “–human-readable” and presents the data in a format that is a bit easier to read. From the output, we can see the Device Mapper name of the device, the size of the back-end storage device, how much data is used, how much capacity is available, and the percentage of space deduplication is saving us.

Though we haven’t written any data yet, there is already 4GB, or 10%, of the volume in use. This is because the Universal Deduplication Index has already been written to disk. This is basically a database that keeps a record of slab fingerprints and their locations. This is what makes deduplication possible. Therefore, using VDO either on small back-end disks or with data that does not get at least 10% deduplication will actually be less efficient than using that storage as a regular volume.

The system thinks that our underlying disk is 3GB, even though we know it is only 1GB large. Since the system has no idea what the size of the VDO back-end disk is, it is currently up to the system administrator to manage the disk capacity and ensure that the back-end disk does not fill up.

Use the VDO Volume as a Normal Disk Device. Now that we have our VDO device created, we can partition it and put a filesystem on the partition, or even use it for an LVM volume. To format and mount the volume:
# mkfs.xfs  /dev/mapper/vdo1
# mkdir /vdo
# mount /dev/mapper/vdo1 /vdo

Inspect the VDO Volume size before and after writing 500MB data to it
# vdostats –hu
# dd if=/dev/zero of=/vdo1/newfile bs=1024 count=512000

See the disk utilization using the df command
# df -h

Not surprisingly, the ‘df’ output shows that the first file system is now using 3.5GB of space.

When we delete files from the filesystem, VDO only delete the pointers to the deduplicated blocks, which are still on the back-end storage. Clearly this is not ideal. To reclaim the capacity that has been orphaned by deleting the files, we use the command fstrim.
# fstrim /vdo1

And now we see that our capacity on the VDO back-end volume has been reclaimed.
# vdostats –hu