Good bye vXpertise : hallo vLenzker

This will be the last post here on vXpertise. For the last years Mathias and myself created many (sometimes) interesting posts here on vXpertise. I still can’t believe we had around 50.000 impressions within the last year, but always great that someone out there (except the bots) is reading stuff we produced.

After some bad hacks this year where we lost some of our articles and some personal development within our career we decided not to publish anymore on vXpertise.

Since I am still trying to improve my writing I will continue to create blog posts on virtualization topics on

http://vLenzker.net

Screen Shot 2015-11-22 at 23.05.38

I hope you will find some time to check my content there. If so, follow me on twitter or write me a mail ūüėČ

vSphere Replication 6.0 part 3: enhanced linked mode, backup and recovery scenarios

In the 3rd part of my series I am going to talk about the usage of vCenter enhanced linked mode and vSphere Replication 6.0 and how it can be used to protect the vSphere replication infrastructure itself.

In the newest version vSphere replication makes use of the LookupService provided by SSO on the new Plattform Service Controller (PSC). Having multiple vCenter instances sharing the same PSC, the so called vCenter enhanced linked mode, we are not just able to manage all vCenter within a single vSphere Web Client. We can also use vSphere replication to replicate and therefore protect VMs from one site to another and migrate simply a VM back after a restore of the protected site within an integrated view.

The following demonstrated a logical view on a recommended vCenter enhanced linked mode setup.

vsphere_replication_enhanced_linkedMode

This architecture has a lot of benefits. You have a vCenter in both sites which is required when you are forced to recover your VMs (in a supported way). As soon as we are having our vCenter in an enhanced linked mode we are able to select all joined vCenter as a target for our vSphere replication protection.

vSphere Replication linked mode target site

I see¬†very often that the backup strategy of some organizations¬†does not take it into consideration that you very often¬†MUST have a vCenter to recover a VM with your backup solution ( if there is no ’emergency-direct-to-ESXi-recovery-feature’ included). For sure there are ways to register the replicated VM back to on the recovery site, but hey … (officially) we need to make sure that our recovery procedures are supported by the vendor.

In the current situation there is one thing I am uncomfortable with. The current recommended way by VMware tells us to create a dedicated PSC-Cluster with a Network Load Balancer in front of it. Since only NSX, F5 and NetScaler is supported this puts a little additional fee for licensing, operating and¬†implementing of the solution. To be honest, I don’t believe to see such a setup pretty often in non-enterprise environments (On this level people are waiting for vVol replication support ;-)).

The only ‘easier’ suitable option would be to a solution like the following in place

vcenter_enhanced_linked_mode

Referring to VMware blog post on the new PSC architecture possibilities the only recommended option is the one mentioned in the beginning. I am currently evaluating and searching discussions about the pros/cons of the mentioned configuration. I will write about the findings in a different post.

Protect and Recover vSphere Replication Appliances and Server (Demo)

It’s worth to remember protecting the vSphere Replication Appliances as well, so that in case of an outage your are able to bring back the replication infrastructure pretty painless. I am going to show you how to recover from a vSphere Replication Appliance data-loss.

In my Lab environment I have two sites and I am going to protect the vSphere replication appliance from LE02 (managed by LE02VCE03) to LE01 (managed by vCSA01). The PSC of each vCenter has joined to the same SSO-Domain.

On my protected site I have 6 machines protected.

In the first scenario I have lost my vSphere replication appliance data on the protection site, so I recover it (vSRA) with help of vSphere replication

vSphere_replication_1

and once the original site has been restored, I failback to it via cross vCenter vMotion.

vSphere_Replication_2 vSphere_Replication_4

One thing you need to take care of is that the vSphere Replication Appliance and Server are registered against a vCenter. If you restore this machine in the way I described it above or with any other backup solution that restores the VM you need to make sure to re-register the VM with the vCenter, otherwise you see the following error within the vSphere replication menu.

vSphere_replication_6

 

So what to do? Register the recovered VM as a vSphere replication server

Screen Shot 2015-07-14 at 16.16.10

and verify that all of your vSphere replication jobs are still in place / running.

Screen Shot 2015-07-11 at 11.15.50

Voila… we recovered the vSphere Replication Appliance and can go on with our next test.

Recover  protected virtual machines with and failback with cross vCenter vMotion (Demo)

My protected site has been failed and the data has been lost. Lucky me I was able to recover all protected VMs on my recovery site. Depending on the network-characteristics you might be forced to change the IPs of your VMs (PowerCLI can be your friend ūüėČ )

Screen Shot 2015-07-14 at 17.11.58

Screen Shot 2015-07-13 at 21.32.39

After the rebuild of my primary site. I was able to to failback/migrate all VMs with cross vCenter vMotion to the original site.

Screen Shot 2015-07-13 at 22.53.59

 

Finalize the steps and voila. You have successfully failed back the VMs.

Make sure to configure a re-protection of the virtual machines.

Final words

The thing I am still missing is a smooth way of having a simple setup of a vCenter in a linked mode. Once I lost my protected site the behaviour of the Web Client was getting really slow and sluggish. Even after the site recovery I needed a reboot of my primary vCenter to get it fully functional again. At this time I am still not sure what’s the best way to establish a vCenter in enhanced linked mode in a ‘stretched’ environment. Any input / discussions / opinions are very appreciated.

 

 

vSphere Replication 6.0 ‚Äď Part 2: vSphere Replication performance and SLAs

After a few weeks and several reasons (professional and non-professional) I finally restarted writing about my current vSphere replication version 6.0 series. This part 2 focus on some network design options and how they might impact SLAs defined for the Recovery Point Objective (RPO).

Since I summarized the architecture and components in part 1 I am now going to analyze the effects on the performance based on the network design decisions.

Option¬†1: ‚ÄúKeep as much traffic as possible within the ESXi‚ÄĚ

vsphere_replication_design2

Result via ESXTOP:

ESXTOP - option 1

-> With the network configuration to minimize the routing effort I was able to nearly utilize to complete vmnic adapter (ca. 900 Mbit / s)

Option¬†2: ‚ÄúHaving replication traffic routed between vSphere Replication Appliance and the VMkernel port‚ÄĚ

vsphere_replication_design1

Result via ESXTOP:

ESXTOP Option 2

-> As expected the throughput dropped nearly by 50% to around 440 Mbit / s.

I know that those 2 results are depending on the specific characteristic of my homelab environment. The reason I have written that down was to create an awareness that the network decision has an impact on the replication performance and therefore maybe on the fact if you can meet an SLA or NOT.

Let’s make a short calculation within a small scenario.

RPO ‚Äď Recovery Point Objective: How much data can get lost during a failure. This value is configured during the setup of a replication job and defines within which time-interval the concrete replication is started

Amount machines 15
VM Size 100 GB = 102400 MB
Max average daily disk-change rate 5%
Max Replication transfer rate option 1 901 Mbit / s = 112,625 MB / s
Max Replication transfer rate option 1 440 Mbit / s = 55 MB / s

The initial replication can be calculated with the following formula:

Screen Shot 2015-06-30 at 22.23.10

and will take the following amount of time in our scenario:

Option 1:

Screen Shot 2015-06-30 at 22.23.37

Option 2:

Screen Shot 2015-06-30 at 22.24.01

To meet a SLA we are in most cases more interested about how long the ongoing replication will take place.

Screen Shot 2015-06-30 at 22.24.36

Option 1:

Screen Shot 2015-06-30 at 22.25.04

Option 2:

Screen Shot 2015-06-30 at 22.25.28

So if you have an RPO defined with 15 minutes there is a risk not to meet the SLA within option 2.

Maybe I repeat myself, but that this is just an example of a calculation (and depending on the use case the limiting factor will be the link between the protected and replicated site). Nevertheless you need to get aware of the following relevant metrics when you design replication:

  • replication-throughput
  • change-rate
  • number and size of your VMs.

In production we don’t want to receive an RPO violation alarm (technically or by the service manager ;-). If you can’t meet the requirements in a theoretical calculation, you will not be able to meet them during daily operations.

Which tool can we use to get the above metrics? Replication-throughput via ESXTOP (network-view: n), number and size of your VMs via PowerCLI (If haven’t done stuff with PowerCLI so far, this is a great starting task for it ;-).

For gathering data about the data change-rate within a VM I refer to a PowerCLI script Scott Herold (his name was in the comments) has created a few years ago that used the change-block-tracking mechanism. I found the script via google and you can download it here (Download:¬†CBT_Tracker¬†– Howto). No need to say that you should understand it (and it‚Äôs influence on your system ‚Äď it uses CBT and snapshots ‚Äď see the comments within the script) and test the script first before you use it for your analysis.

Compression ‚Äď The X-Files continues

As I have already said VMware has included a new compression mechanism in 6.0 to speed up the initial copy job. During my first tests (Setup 1 with compression enabled) I had a higher CPU utilization (that’s expected on my vSphere Replication Appliance), but also a lower Throughput of the replication data. I am totally unaware what went wrong here. I will try to figure out more about this effect and keep you informed ;-). If you have any ideas/hints what went wrong in my setup. Please comment or contact via Twitter (@lenzker).

Enhance your #homelab by re-enabling transparent page sharing

“Sharing is caring!”

I decided to quickly write down a few last words about a widely discussed topic before my 4 week journey to Brazil begins.

Transparent page sharing (TPS)

The concept of transparent page sharing has been widely explained and the impact discussed in all of the years (whitepaper). Short version: Multiple identical virtual memory pages are pointing to the same one page within the host memory.

The general behaviour has changed multiple times with enhancements within AMD’s & Intel’¬†newer CPU generations where large pages were intensively used¬†(2MB instead of 4KB areas –> Increasing the chance of a TLB-Hit)¬†and the benefits of TPS couldn’t be used until the ESXi host gets under memory pressure and changes it’s memory state to break up with his girlfri….I mean the large pages into small ones.

Last year some security concerns came up with this feature and VMware pro-actively deactivated the TPS feature with all newer ESXi versions and updates (KB).

I don’t want to talk about the¬†impacts and design decisions on productive systems, but on homelab environments instead. It is nearly a physical constant that a homelab always lacks of memory.

By deactivating large pages & the new security mechanisms you can save a nice and predictable amount of consumed/host-backed memory.

And especially in homelab environments the risk of a lower performance (caused by the higher memory access time through the higher probability of TLB-misses) and of security concerns might be mitigated.

What to do?

Change on each ESXi host the advanced system settings

Mem.AllocGuestLargePage=0

advancedsetting1

Mem.ShareForceSalting = 0

advancedsetting2

and wait a couple of minutes until TSP kicks in.

Effect on my homelab

60¬†minutes after the setting was done¬†the amount of consumed RAM on my cluster (96GB RAM – setup) decreased from 55GB to 44,8GB which means around 20% memory consumption has been saved in my VM constellation (multiple identical Windows 2012 R2 and nested ESXi which have a high ‘shared’ value)

vCenter_TPS_consumed_memory

So if you need a very quick approach to workaround the memory pressure in your homelab and you can live the potential performance loss –> re-activate Transparent page sharing as a first step to optimize your environment. Sure you can also skip the deactivation of large pages and hope that during a change in the memory state the large page breakup process is quick enough and therefore TPS works again. But I preferred a permanent and audit-able approach of monitoring the amount of shared memory in my lab.

2nd step to optimize the memory consumption –> Size your VMs right… but this is nothing I will tell you about since my plane is going to start… my substitute¬†vRealize Operations¬†will do the job ūüėČ

#PowerCLI: Distributed switch portgroup usage. Good vs. bad vs. Max Power approach

From now on there are three ways of doing things: the right way, the wrong way, and the Max Power way.

Isn’t that the wrong way?

Yes! But faster!

This quote from Homer Simpsons came directly into my mind when I was doing some PowerCLI scripting during the week.

maxpower

I started with the wrong / Max Power way and suddenly came to a much more smarter solution – the right way.

The task was to gather the usage on the virtual distributed switch portgroups within a world-wide environment with around 30 vCenter. (Final script on the bottom of this blog)

Once again I realized there are many roads to Rome and even with PowerCLI you can either go there by crawling or using a plane.

My first approach was to get each portgroup and have a look through each port if it has a connection to the virtual network adapter of a VM (each VM only has 1 Network adapter).

$ports = Get-VDSwitch | Get-VDPort -VDPortgroup $pg
$portsconnected = $ports | Where {$_.ConnectedEntity.Name -eq 'Network adapter 1'}

That approach was incredibly slow (> 12 hours) since it took a while to get all port objects of the distributed switch (more than 5000 per vDS).

Thanks to Brian Graf‘s great blog article we know how to access vSphere objects extension data in a much more elegant way.

$networks = Get-View -viewtype Network
Foreach ($network in $networks){
    $pgname = $network.Name
    $connectedports = ($network.VM).count 
}

Doing it that way it took 15 minutes instead of 12++ hours.

It really makes a huge difference if you code something right or wrong. That’s counts for Software, SQL-queries and also for all kind of scripts we use and built in our daily IT-infrastructure world.

The final script gives you an csv output file with the values

Datacenter, PortgroupName, VLANID, NumberOfConnectedPorts

Make sure to use Powershell 3.0++ so you can use the -append option in the export-csv cmdlet.

Enjoy.

Good one

$results = @()

$cluster = get-cluster | Sort-Object -Property Name
$dcName = (Get-Datacenter).Name
$networks = Get-View -viewtype Network

Foreach ($network in $networks){
    $pgname = $network.Name
    $pg = Get-VDPortgroup -Name $pgname
    $vlanid = $pg.vlanConfiguration.VlanID
    $connectedports = ($network.VM).count

    $details = @{
        PortgroupName = $pgname
        VLANID = $vlanId
        NumberOfConnectedPorts = $connectedports
        Datacenter = $dcName
    }

    $results += New-Object PSObject -Property $details
}

$results | export-csv -Append -Path c:\temp\newvDSnetworkConnected.csv -NoTypeInformation

 Bad one

$results = @()

$dcName = (Get-Datacenter).Name
$pgs = Get-VDSwitch | Get-VDPortgroup | Where {$_.IsUplink -ne 'True'}

foreach ($pg in $pgs){

    $ports = Get-VDSwitch | Get-VDPort -VDPortgroup $pg
    $portsconnected = $ports | Where {$_.ConnectedEntity.Name -eq 'Network adapter 1'}

    $pgname = $pg.name
    $vlanId = $pg.VlanConfiguration.VlanId

    $connectedports = $portsconnected.count
    $details = @{
        PortgroupName = $pgname
        VLANID = $vlanId
        NumberOfConnectedPorts = $connectedports
        Datacenter = $dcName
    }
    $results += New-Object PSObject -Property $details 
} 

$results | export-csv -Append -Path c:\temp\vDSnetworkConnected.csv -NoTypeInformation

vLenzker #Homelab: quiet, small, scaleable and powerful(!?)

A few months ago a simple thought came into my mind and it didn’t left for several months.

‘ With new and cool software like vSAN, Pernixdata FVP, vROPS, vRAC, vSphere 6.0, vSAN, …. You need a new #Homelab to test this stuff’

Yeah… I somehow felt inception-ized ;-).

In the end of last year I had a phone-call with Manfred Hofer form vbrain.info about his great #homelab posts and design-decisions on his blog. Even though I did not chose one of his proposed designs I really like to thank you Fred for your efforts and great sum up.

Since I was asked by multiple people to document my new hardware, I quickly summarized it here:

I had the following requirements for my #homelab:

  • min. of 3 nodes (for getting vSAN up and running)
  • min. of 96GB RAM
  • low-power
  • low-noise (currently it’s standing close to my office-desk)
  • small
  • min. 2 NICS per node

I didn’t really cared about ECC-support, IPMI, etc… Nothing productive will run there… I just need a suitable performance ( ca. 10-16 cores / 2000-3000 4K 70/30 random IOPS / 2-4 TB Disk) capacity /¬†to do some quick’n dirty testing / customer environment simulations. Intel NUCs would have been a perfect choice, but the lack of 32 GB functionality disqualified them ;-/

In the end I decided to go for the following solutions.

Computing

  1. Shuttle SH87R6
  2. Intel Core i5-4440S
  3. 4×8¬†GB DDR3 memory
  4. Intel Pro PT 1000 Dual
  5. 1x Crucial CT256MX
  6. 1x Crucial CT512MX

Network

  1. CISCSO SG-300 – 20 ports
  2. Huawei WS311 Wifi-bridge

Storage

  1. Synology DS414 Slim
  2. 1x Crucial CT512MX
  3. Western Digital RED 1 TB

Currently I have vSphere 6.0 running with a vSAN 6.0 datastore. I have also decided to get a dedicated NFS share on the Synology for maintenance/testing reasons so I can easily demote/recreate the vSAN datastore. Having nearly everything¬†on SSD gives me a¬†performance that is suitable for me and gives me a chance to work efficient with new products (even if the local S-ATA controller are limited in their capabilities, but hey… it’s non-productive ;-).

After optimizing some of my cabling and replacing the fan of the Shuttle Barebone I really like the solution on my desk. It’s powerful, small, scaleable enough for the next things I am¬†planning to do. Even if my requirement for hardware is increasing I can scale-up the solution pretty quickly and easy.

homelab_lenzker

So far I was not able to get the embedded Realtek up and running with vSphere 6.0. But to be honest, I haven’t spent much time with it ;-). Once I have an update here, I will let you know.

Shall I upgrade vHW? Enhanced vNUMA support with Hardware Version 11 (vHW11) in vSphere 6.

As with every release VMware increased the version of the virtual hardware of its virtual machines.

ESX/ESXi 4.X vHW7
ESXi 5.0 vHW8
ESXi 5.1 vHW9
ESXi 5.5 vHW10
ESXi 6.0 vHW11

Each ESXi / vCenter has a only a limited set of vHW compatibility  (e.g. vHW11 MUST run on ESXi6.0++ and therefore managed by a vCenter 6.0++. Check the VMware solution interoperability Matrix for that) and multiple virtual machine related feature enhancements (e.g. new OS-support, new device-support, more vCPU, more vRAM, more more more).

vHW11 - upgrade

Andreas Lesslhummer did a short summary on the what’s new features within vHW 11¬†and one feature that jumped directly into my eyes:

‘vNUMA aware hot-add RAM’

In my job people often ask me.

‘Do I need to upgrade my¬†virtual machine hardware version?’¬†

and I respond with a typical consultant/trainer answer

‘It depends! ;-)’

You need the following questions to answer

Q1. What is the current and the oldest version of vSphere where your VMs might run in a worst-case (failover, recovery, etc.)?

A1. Make sure you are able to run the VMs in all environments (Check the compatibility upwards (newest version) and downwards (oldest version)). You don’t want to start to deal with a vCenter converter to downgrade vHW again.

Q2. Do you need the additional features offered by the specific vHW?

A2. For most use-cases I have dealt with so far, it was not necessary to upgrade the Hardware Version for the new features, except you have pretty BIG MONSTER VMS or the customer had VMs with a vHW < 7.¬†With vHW11 a new feature might come in handy if you are dealing with … I don’t know how to call¬†them… ¬†‘a little bit smaller business critical MONSTER VMs (> 8vCPUs)’.

vSphere 6.0 and vHW11 resolves 1 out of 2 constraints still not everyone is aware about regarding (v)NUMA. vNUMA offers the non-unified memory characteristics of the Server to the guest operating system inside a virtual machine. The guest operating system can now schedule more efficient its processes to the vCPU. This is only relevant by default if you have > 8 vCPUs and multiple Sockets defined as your virtual machine hardware (Mathias Ewald wrote an excellent article about it a while ago).

NEW: If you enable memory-hot add to a VM (to increase the memory size during the runtime Рideal for performance baseline definitions OR quick responses to increasing demand) the additional memory will be extended equally over all existing NUMA nodes of the VM if you are on vHW11.

Unfortunately the other constraint still remains in vSphere 6.0. If you enable CPU hot-add in your VM, the vNUMA characteristic will be hidden from the Guest-OS (KB –¬†2040375).

Make sure you are aware of the hot-plug settings you have done in your environment with your business critical VMs, since it might have a performance impact (Sample here).

If you want to have memory hot add available including vNUMA support and your complete environment is running on vSphere 6.0, upgrade to vHW11 enable memory-hotplug and disable cpu-hotplug.

vSphere Replication 6.0 Part 1: Architecture and features at a glance. vSphere Replication standalone

vSphere Replication is a really cool application helping us to replicate our virtual machines on a VM-level without the need to have a dedicated and replicating storage.¬†Besides the conservative replication methods it can also be used to replicate to¬†a public cloud provider (call it DaaS or whatever ūüėȬ†like VMware vCloud Air (I am still lacking technical-deep knowledge of VMware’s hybrid cloud appriache. It will be a seperate blog-post once I know more;-).

In the following I want to give an architectural overview of the new version 6.0 of vSphere replication. I realized during some of my Site Recovery Manager class people might get confused with some KB-mentioned terminologies and maximums so I wanted to create something that clarifies all of those things.

This article is the first part of the vSphere replication 6.0 series (not sure if series is the right word if only 3¬†episodes are planned ūüėČ )

General features and what’s new in vSphere Replication 6.0 (NEW)

  • Replication on a virtual machine level
  • Replication functionality embedded in the vmkernel as vSphere replication agent
  • RPO (Recovery Point Objective = data that can be lost): 15min ‚Äď 24hours
  • Quiescing of the guest-OS to ensure crash-consistency for Windows (via VSS) and Linux (NEW)
  • Replication indepedant of theunderlaying storage technology (vSAN, NAS, SAN, Local)
  • Support for enhanced vSphere functionalities (Storage DRS, svMotion, vSAN)
  • Initial replication can be optimized by manually transferring the virtual machine to the recovery location (replication seeds)
  • Transferring of the changed blocks (not CBT) and initial synchronization can be compressed and therefore minimize the required network-bandwidth (NEW)
  • Network can be more granular configured and more isolated from each other (NEW)¬†with VMKernel functions for NFC and vSphere Replication (vSphere 6.0 required)
  • Up to 2000 Replications (NEW)

Components required for vSphere replication

For vSphere replication we need besides to the mandatory components (vCenter, SSO, Inventory Service, Webclient and ESXi) download the vSphere replication appliance from VMware.com.

The general task of the vSphere replication appliance is to get data  (VM files and changes) from the vSphere agent of a protected ESXi and transfer it to the configured recovery ESXi (via a mechanism called Network File Copy РNFS).

Now it might get a little bit confusing. The appliance we are downloading are in fact 2 different appliances with 2 different OVF-file pointing to the same disk file.

  1. 1x OVF (vSphere_Replication_OVF10.ovf) which is the virtual machine descriptor for the vSphere replication (manager) appliance ‚Äď used for SRM OR single vSphere Replication ‚Äď 2 or 4 vCPU / 4GB RAM
  2. 1x OVF (vSphere_Replication_AddOn_OVF10.ovf) which is the virtual machine descriptor for the vSphere replication server ‚Äď can be used to balance the load and increase the maximum number of replicated VMs ‚Äď 2 vCPU / 512 MB RAM

vSphere replication (manager) appliance

The vSphere replication (manager) appliance is the ‚Äėbrain‚Äô in the vSphere replication process and is registered to the vCenter so that the vCenter-Webclient is aware of the new functionality. It stores the configured data in the embedded postgres-SQL database or in an externally added SQL database. The VMware documentation typically talks about the vSphere replication appliance, to make sure not to mix it up with the replication server I put the (manager) within the terminology. The vSphere replication (manager) appliance includes¬†also the 1st vSphere replication Server. Only 1 vSphere replication appliance can registered with a vCenter and supports theoretically up to 2000 replications if we have 10 vSphere replication server in total. Please be aware of the following KB if you want to replicate more than 500 VMs since minor changes at the Appliance are mandatory.

vSphere replication server

The vSphere replication server in general is responsible for the replication job (data gathering from source-ESXi and data transferring to target-ESXi). It is included within the vSphere replication appliance and can effectively handle 200 replication jobs. Even though it I have read in some blogs that it is only possible to spread the replication load over several vSphere replication server in conjunction with Site Recovery Manger it works out of the box without the need for Site recovery manager.

Sample Architecture

The following picture should illustrate the components and traffic flow during the vSphere replication process.

vsphere_replication_overview

 

The following diagrams shows two sample Architectures regarding the network.

In the first diagram the vSphere replication network MUST be routed/switched on layer 3, while in the  second example we are able to stay in a single network segmet with our replication-traffic (thanks to the new VMkernel functionalites for NFC/Replication traffic in vSphere 6).

Option 1: External routing/switch mandatory (would be a good use case for the internal routing of NSX ;-)):

vsphere_replication_design1

Option 2: No routing mandatory & switching occurs within the ESXi

vsphere_replication_design2

Of course those are only two simple configuration samples, but I want to make you aware of that the (virtual) network design has an impact on the replication performance in the end.

I will focus on the performance difference between those two options (and the usage of the compression mode within an LAN) in part 2. Stay tuned ūüėČ

[#Troubleshooting] the operation is not allowed in the current state after replicated storage failover

I received a call with a typical error message within the vSphere world: When powering on VMs we received a warning with the following message

‘the operation is not allowed in the current state’

Scenario summary: vCenter/ESXi 5.5U3

  1. Storage LUNs were replicated to a second device (async)
  2. Failover to second storage device was triggered
  3. Datastores were made visible to the ESXi and the VMFS was resignatured
  4. VMs were registered to the ESXi hosts

Symptoms

When the recovered VMs are powered on, the mentioned error occurred.

Screen Shot 2015-03-27 at 17.22.15

A reboot of the ESXi, vCenter and its services and even an ESXi reconnect did not solved the problem, so I started a more deterministic root cause analysis.

Root cause:

The recovered virtual machines CD-Drive were referring to an ISO-file on a non-existent NFS datastore that hasn’t been recovered. Unfortunately¬†the error message itself was not pointing to the root cause.

Root cause analysis:

checking the vCenter vpxd.log didn’t gave us much information about the problem:


vim.VirtualMachine.powerOn: vim.fault.InvalidHostConnectionState:
mem> –> Result:
mem> –> (vim.fault.InvalidHostConnectionState) {
mem> –> dynamicType = <unset>,
mem> –> faultCause = (vmodl.MethodFault) null,
mem> –> host = ”,
mem> –> msg = “”,
mem> –> }
mem> –> Args:
hmm, yeah…not very much useful information. So next step -> checking the hostd.log within the ESXi host.
2015-03-27T12:03:36.340Z [69C40B70 info ‘Solo.Vmomi’ opID=hostd-6dc9 user=root] Throw vmodl.fault.RequestCanceled
2015-03-27T12:03:36.340Z [69C40B70 info ‘Solo.Vmomi’ opID=hostd-6dc9 user=root] Result:
–> (vmodl.fault.RequestCanceled) {
–> dynamicType = <unset>,
–> faultCause = (vmodl.MethodFault) null,
–> msg = “”,
–> }
2015-03-27T12:03:36.341Z [FFBC6B70 error ‘SoapAdapter.HTTPService.HttpConnection’] Failed to read header on stream <io_obj p:0x6ab82a48, h:66, <TCP ‘0.0.0.0:0’>, <TCP ‘0.0.0.0:0’>>: N7Vmacore15SystemExceptionE(Connection reset by peer)
2015-03-27T12:03:40.024Z [FFBC6B70 info ‘Libs’] FILE: FileVMKGetMaxFileSize: Could not get max file size for path: /vmfs/volumes/XXXXXX, error: Inappropriate ioctl for device
2015-03-27T12:03:40.024Z [FFBC6B70 info ‘Libs’] FILE: File_GetVMFSAttributes: Could not get volume attributes (ret = -1): Function not implemented
2015-03-27T12:03:40.024Z [FFBC6B70 info ‘Libs’] FILE: FileVMKGetMaxOrSupportsFileSize: File_GetVMFSAttributes Failed

so it seems that we had some kind of IO problems. Checking /vmfs/volumes/XXXX we realized that we were not able to access the device.
The volume itself was a NFS share mounted as a datastore and as you probably know are also mounted in the /vmfs/ folder of the ESXi.

Even though the VMs are¬†running on block-based storage (iSCSI) I found out that there was still a dependancy between the VM and the not-reachable NFS device -> The VMs had an ISO-file from a NFS datastore mounted.¬†During the failover of the storage the NFS datastore hasn’t been restored and the VM was trying to access the NFS share to include the ISO file.

Summary:

Those things happen all the time, so take care to unmount devices when you don’t need them anymore (Use RVTools/Scripts and establish an overall operating process -> check my ops-manual framework ūüėČ ). Those little things can be a real show-stopper in any kind of automatic recovery procedures (scripted, vSphere Site Recovery Manager, etc.)

vSphere Web Client 6.0: working remote via tethering. Bandwidth comparison RDP vs. local Browser

When I first installed vSphere 6.0 I was pretty impressed about the performance gain of the vSphere Web Client. Finally the Web Client is a tool I can work productive with and mustn’t be afraid¬†to be marked as unproductive from¬†my customers (it’s tough to argument a higher hour-rate if I wait 20% of my time for the UI ūüėČ ).

So my homelab was installed with vSphere 6.0 and I tried to connect to it via VPN from my hotel wifi. Since the wifi was blocking my VPN attempts I was forced to tether/share the internet via my smartphone.

Sharing internet on… starting OpenVPN against my homelab… opened Chrome to use the Web Client and.… the useability with the Web Client 6.0 was really really good.

After a few minutes I received a warning by my provider T-mobile that my data plan has reached the 80% thresholds. I know 500MB included in my data plan is not that much, but still I was really suprised seeing this OpenVPN statistics after a few minutes.

Screen Shot 2015-03-16 at 21.11.54

 

Since I haven’t used any other services than the vSphere Web Client I wanted to know how much bandwidth working in a local Browser via the Web Client really needs.

I created a test-case (it’s Sunday, weather is bad, Bundesliga is pausing) which should take around 3-4 minutes:

  1. Login to the vCenter via the Web Client
  2. Navigate around in the home menu
  3. Select Hosts and Clusters
  4. Choose an ESXi Host and navigate to Manage / Settings / Networking
  5. View the vSwitch, virtual adapater and physical Nics settings
  6. Go to related Objects and select a datastore
  7. Browse the content of the datastore
  8. Create a new VM with default settings
  9. Power on the VM

I did this test with 2 Browsers Chrome and Firefox (to make sure the results are nearly identical) and observed the results via the activity monitor of MacOS. As a 3rd alternative I have chosen to use a remote connection via Microsoft Remote Desktop (native resolution, 16 Bit color) and did the same test-case steps mentioned above.

Here are the results:

  1. Chrome – duration: <4 minutes, bandwidth: ca. 21 MB
  2. Firefox – duration: < 4 minutes, bandwidth: ca:26 MB
  3. RDP – duration: < 3.5 minutes, bandwidth: ca. 2 MB

Of course there are a lot of factors not considered (high activity on the vCenter that might increase the data most certainly), but the numbers should give you a feeling that the better performance of the Web Client seems to come side by side with a pretty bandwidth sensitive caching on the client side. So if you work with limited bandwidth or any kind of throughput limitation use an RDP connection to a Control-Center VM within your environment that is able to use the Web Client for your daily vSphere operations.

Appendix:

Screen Shot 2015-03-29 at 15.27.18 Screen Shot 2015-03-29 at 15.34.47 Screen Shot 2015-03-29 at 15.38.05

© 2016 v(e)Xpertise

Theme by Anders NorenUp ↑