After a few weeks and several reasons (professional and non-professional) I finally restarted writing about my current vSphere replication version 6.0 series. This part 2 focus on some network design options and how they might impact SLAs defined for the Recovery Point Objective (RPO).

Since I summarized the architecture and components in part 1 I am now going to analyze the effects on the performance based on the network design decisions.

Option 1: “Keep as much traffic as possible within the ESXi”


Result via ESXTOP:

ESXTOP - option 1

-> With the network configuration to minimize the routing effort I was able to nearly utilize to complete vmnic adapter (ca. 900 Mbit / s)

Option 2: “Having replication traffic routed between vSphere Replication Appliance and the VMkernel port”


Result via ESXTOP:

ESXTOP Option 2

-> As expected the throughput dropped nearly by 50% to around 440 Mbit / s.

I know that those 2 results are depending on the specific characteristic of my homelab environment. The reason I have written that down was to create an awareness that the network decision has an impact on the replication performance and therefore maybe on the fact if you can meet an SLA or NOT.

Let’s make a short calculation within a small scenario.

RPO – Recovery Point Objective: How much data can get lost during a failure. This value is configured during the setup of a replication job and defines within which time-interval the concrete replication is started

Amount machines 15
VM Size 100 GB = 102400 MB
Max average daily disk-change rate 5%
Max Replication transfer rate option 1 901 Mbit / s = 112,625 MB / s
Max Replication transfer rate option 1 440 Mbit / s = 55 MB / s

The initial replication can be calculated with the following formula:

Screen Shot 2015-06-30 at 22.23.10

and will take the following amount of time in our scenario:

Option 1:

Screen Shot 2015-06-30 at 22.23.37

Option 2:

Screen Shot 2015-06-30 at 22.24.01

To meet a SLA we are in most cases more interested about how long the ongoing replication will take place.

Screen Shot 2015-06-30 at 22.24.36

Option 1:

Screen Shot 2015-06-30 at 22.25.04

Option 2:

Screen Shot 2015-06-30 at 22.25.28

So if you have an RPO defined with 15 minutes there is a risk not to meet the SLA within option 2.

Maybe I repeat myself, but that this is just an example of a calculation (and depending on the use case the limiting factor will be the link between the protected and replicated site). Nevertheless you need to get aware of the following relevant metrics when you design replication:

  • replication-throughput
  • change-rate
  • number and size of your VMs.

In production we don’t want to receive an RPO violation alarm (technically or by the service manager ;-). If you can’t meet the requirements in a theoretical calculation, you will not be able to meet them during daily operations.

Which tool can we use to get the above metrics? Replication-throughput via ESXTOP (network-view: n), number and size of your VMs via PowerCLI (If haven’t done stuff with PowerCLI so far, this is a great starting task for it ;-).

For gathering data about the data change-rate within a VM I refer to a PowerCLI script Scott Herold (his name was in the comments) has created a few years ago that used the change-block-tracking mechanism. I found the script via google and you can download it here (Download: CBT_Tracker – Howto). No need to say that you should understand it (and it’s influence on your system – it uses CBT and snapshots – see the comments within the script) and test the script first before you use it for your analysis.

Compression – The X-Files continues

As I have already said VMware has included a new compression mechanism in 6.0 to speed up the initial copy job. During my first tests (Setup 1 with compression enabled) I had a higher CPU utilization (that’s expected on my vSphere Replication Appliance), but also a lower Throughput of the replication data. I am totally unaware what went wrong here. I will try to figure out more about this effect and keep you informed ;-). If you have any ideas/hints what went wrong in my setup. Please comment or contact via Twitter (@lenzker).