As I mentioned in previous post (When To Use SSD With SDS - Part 3), IzumoFS is equipped with inline deduplication feature. This advanced functionality which was mostly implemented in high cost backup alliance or all flash storage. In fact, it's very rare to see SDS product which has inline deduplication built in.
I think the combination of inline deduplication and distributed storage is not that common for most of people to know but actually these pair do have benefits beside storage capacity reduction. In this article, I’ll discuss why inline deduplication is implemented in IzumoFS.
Although inline deduplication has become popular these days, let me illustrate the difference between post-processing and inline deduplication.
Post-processing deduplication will first write data into storage and then performs deduplication like a batch execution. Which means it requires extra capacity to store whole data at first. Usually it has less impact on controller performance and has higher deduplication rates than inline deduplication.
Inline deduplication will dedupe data before writing data to storage. It has higher impact on controller compare to post-processing deduplication but it doesn’t write any duplicated data to storage. It reduces total amount of write to the disk which leads to longer life of the SSD. Other benefit is that duplicated data will stay on memory as a cache so it will gain read performance of storage because it increases cache hit rates.
One of the biggest challenge for distributed storage is the communications between each nodes. As its nature, distributed storage is constantly performing network request between nodes. It depends on redundancy model that product has but usually it requires to transfer data at the time of read and write to the storage. If there are changes in cluster structure such as adding or removing node, there will be rebalancing and sometimes it happens across multi site with long physical distance.
That’s been said, network often becomes a bottleneck of distributed storage. To overcome this issue some product requires rich environment such as minimum of 1Gbps ethernet, InfiniBand or separate network between data access and node communication.
Inline deduplication of IzumoFS was implemented to solve those kind of problems. Usually inline deduplication is implemented to the storage to increase capacity efficiency or lengthen life of SSD, but for the distributed storage, like IzumoFS, it brings great benefits which is to reduce network load 1.
The timing when IzumoFS execute deduce and distribution will depends on the settings with writing guarantee policy. Writing guarantee policy is internal settings of IzumoFS and I’ll skip detailed explanation about it but briefly it has switches 2 modes for IzumoFS. (1) performance oriented and (2) Consistency oriented. It’s used to determined consistency policy of the whole system but it makes changes on how IzumoFS handles duplicated data as well.
(1) will distribute data after deduplication but (2) will first distribute data and then deduplicates them. For each write data IzumoFS will calculate its hash and tries to avoid saving the data if IzumoFS finds same data across whole cluster. These procedure will be performed in regardless of redundancy policy or file access protocol. In case of (1), there will be no duplicated data transferred through network so it can cut down network load.
Another effect of this is that random writes will be processed on memory and transferred as sequential writes to HDD. 2 So HDD can perform better.
The first thing I hear most from our customer when talking about inline deduplication is about CPU overhead. Most of storages that does inline dedupe will perform deduplication by few KB to tens of KB so it has relatively high CPU load. However IzumoFS will perform deduplication at 1MB data chunk which is relatively big value. This is because IzumoFS is designed to work on low spec machines such as Xeon E3 series with 16GB of memory and perform deduplication at few % of CPU overhead.
As a principle the size of chunk to be deduplicated and capacity efficiency has a trade off relationship. It’s true that IzumoFS has lower capacity efficiency when compared to storages which performs deduplication by tens of KB. The base concept behind inline deduplication at IzumoFS is to not to transfer duplicated network or increase performance by gaining cache hit rates and sequentialize IO pattern. We believe deduplicating data by 1MB of chunk size has its own benefits, especially with distributed storage.
When talking about capacity efficiency, in some data use case such as VM images we have seen over 70% of data reduction (though this very much depends on environment or workload). You could expect good reduction with VDI environment. Other common use case is when using IzumoFS as a file server, if user just copies file in other directory with different name, in such case deduplication rate is going to be very high.
I have been introducing the good part of inline deduplication so far but of course it doesn’t resolve all of the issue we find in network problems. For example to tackle issue with data rebalancing we must have different solution such as node replacement. If other application uses same network as a storage, maybe it’s better to separate those network at first. Also deduplication doesn’t perform so well with media data such as images or videos. If you have any environment specific question please contact IzumoBASE. We’ll be more than happy to help you.
By understanding inline deduplication you can maximize the capability of the storage. Ordinary SDS that transfers all data to the network has been suffering from network bottle neck. Despite the fact that inline deduplication is used mostly to reduce data size or lengthen life of SDS, distributed storage will benefits all new capability from it. I think this technology has huge possibility to expand use cases of distributed storage.