I’ve gotten lots of questions on the “if that set of files were the only files on the Data Domain system” statement from my last post, Not The Droids You Were Looking For, where I went over Physical Capacity Measurement. (As a side note, in newer documentation, it’s referred to as Physical Capacity Reporting or PCR, however, for the sake of older documentation, I’ll refer to is as PCM). It is a caveat when creating PCM reports on individual MTrees and how the capacity is counted. So, in response, I’ve created a new diagram to help explain it a little further.
As you recall, Data Domain inline deduplicates data segments as data is being written. If you have a deduplicated segment that is referenced by 2 MTrees it will show up on the report for each of those MTrees. From a capacity standpoint, if you were totaling (summing) the capacity reported by PCM for each MTree, you’ll count that same deduplicated segment of data twice.
In this example/scenario, data is backed up to the HR MTree. As it is being backed up, Data Domain inline deduplicates this data and does so using variable-lengths, only writing the unique data to physical disk. The SISL and deduplication process is more complicated than this, but for the sake of this explanation, I’ve simplified it. Notice that the “ERI” data is deduplicated as data is being written to the HR MTree. The same goes for the “KA” data.
Then, at a later day or time, a separate backup kicks off and is written to the Billing MTree. Since there is duplicate data, Data Domain dedupes that data and does not write it to physical disk. In this case, “KA”, “KMC”, and “ERI” are all duplicate and are not written to disk. Notice also that “KA” occurs twice in the Billing MTree, but neither are written to disk. This is muy importante when explaining the PCM caveat.
When you look at how much data is sitting on the physical disk ACROSS BOTH MTREES, you’ll see that it comes out to be 48K. This is “post-comp”, or in other words, after deduplication and compression have taken place.
If you were to run a PCM report on the HR MTree, it would show that it is taking up 28K on disk. Again, this is post-comp. Since “ERI” and “KA” are only written once, they are not counted for the post-comp total.
But what about the Billing MTree? Here’s where it gets a little confusing. Let’s go back to the statement from earlier. When you run a report on an MTree, it measures the physical capacity that would be consumed on a Data Domain system by a set of files, if that set of files were the only files on the Data Domain system. Even though “KA” was already on the physical disk when backups were written to the Billing MTree, you have to keep in mind that the PCM report will still count at least one instance of “KA” as if it was originally written to that MTree. Therefore, the total in the report for the Billing MTree will be 36K.
If you total up the PCM reports for the HR MTree (28K) and the Billing MTree (36K) it will total up to 64K which is more than what is actually sitting on physical disk.
So I’ll also restate something from the previous post: This is an important distinction, especially when it comes to multi-tenancy. It has to be counted on each report (regardless if you’re doing it for tenant, mtree or sub-mtree) – otherwise, in a chargeback model, the guy who initially wrote the data would pay and the others not. In this case, the business unit that is being charged for the HR MTree would get charged for their data that is sitting on disk (such as the “KA” data). Additionally, the business unit writing to the Billing MTree would get charged for this data as well. This may seem like “double charging”, but how would you fairly charge HR for it and not Billing? Additionally, should HR delete that data from their backups, it would still reside on disk since it is referenced by the Billing MTree, so this has to be taken into account.
See how this can make a difference? Happy de-duping.
Categories: Data Domain
Leave a comment