TagTroubleshooting

[#Troubleshooting] the operation is not allowed in the current state after replicated storage failover

I received a call with a typical error message within the vSphere world: When powering on VMs we received a warning with the following message

‘the operation is not allowed in the current state’

Scenario summary: vCenter/ESXi 5.5U3

  1. Storage LUNs were replicated to a second device (async)
  2. Failover to second storage device was triggered
  3. Datastores were made visible to the ESXi and the VMFS was resignatured
  4. VMs were registered to the ESXi hosts

Symptoms

When the recovered VMs are powered on, the mentioned error occurred.

Screen Shot 2015-03-27 at 17.22.15

A reboot of the ESXi, vCenter and its services and even an ESXi reconnect did not solved the problem, so I started a more deterministic root cause analysis.

Root cause:

The recovered virtual machines CD-Drive were referring to an ISO-file on a non-existent NFS datastore that hasn’t been recovered. Unfortunately the error message itself was not pointing to the root cause.

Root cause analysis:

checking the vCenter vpxd.log didn’t gave us much information about the problem:


vim.VirtualMachine.powerOn: vim.fault.InvalidHostConnectionState:
mem> –> Result:
mem> –> (vim.fault.InvalidHostConnectionState) {
mem> –> dynamicType = <unset>,
mem> –> faultCause = (vmodl.MethodFault) null,
mem> –> host = ”,
mem> –> msg = “”,
mem> –> }
mem> –> Args:
hmm, yeah…not very much useful information. So next step -> checking the hostd.log within the ESXi host.
2015-03-27T12:03:36.340Z [69C40B70 info ‘Solo.Vmomi’ opID=hostd-6dc9 user=root] Throw vmodl.fault.RequestCanceled
2015-03-27T12:03:36.340Z [69C40B70 info ‘Solo.Vmomi’ opID=hostd-6dc9 user=root] Result:
–> (vmodl.fault.RequestCanceled) {
–> dynamicType = <unset>,
–> faultCause = (vmodl.MethodFault) null,
–> msg = “”,
–> }
2015-03-27T12:03:36.341Z [FFBC6B70 error ‘SoapAdapter.HTTPService.HttpConnection’] Failed to read header on stream <io_obj p:0x6ab82a48, h:66, <TCP ‘0.0.0.0:0’>, <TCP ‘0.0.0.0:0’>>: N7Vmacore15SystemExceptionE(Connection reset by peer)
2015-03-27T12:03:40.024Z [FFBC6B70 info ‘Libs’] FILE: FileVMKGetMaxFileSize: Could not get max file size for path: /vmfs/volumes/XXXXXX, error: Inappropriate ioctl for device
2015-03-27T12:03:40.024Z [FFBC6B70 info ‘Libs’] FILE: File_GetVMFSAttributes: Could not get volume attributes (ret = -1): Function not implemented
2015-03-27T12:03:40.024Z [FFBC6B70 info ‘Libs’] FILE: FileVMKGetMaxOrSupportsFileSize: File_GetVMFSAttributes Failed

so it seems that we had some kind of IO problems. Checking /vmfs/volumes/XXXX we realized that we were not able to access the device.
The volume itself was a NFS share mounted as a datastore and as you probably know are also mounted in the /vmfs/ folder of the ESXi.

Even though the VMs are running on block-based storage (iSCSI) I found out that there was still a dependancy between the VM and the not-reachable NFS device -> The VMs had an ISO-file from a NFS datastore mounted. During the failover of the storage the NFS datastore hasn’t been restored and the VM was trying to access the NFS share to include the ISO file.

Summary:

Those things happen all the time, so take care to unmount devices when you don’t need them anymore (Use RVTools/Scripts and establish an overall operating process -> check my ops-manual framework 😉 ). Those little things can be a real show-stopper in any kind of automatic recovery procedures (scripted, vSphere Site Recovery Manager, etc.)

VMware Update to vSphere 5.5 and Horizon View 6.0 – vCenter service not working properly

A few days ago I received a mail of a former student of mine. They have updated their VMware environment to the latest vSphere 5.5U2 and afterwards Horizon View from 5.2 to 6.0.

From a procedural point of view it has seemed that everything has worked fine. But on a second look he has realized that in the Horizon View Manager dashboard the vCenter was marked red (‘service is not working properly ‘) and pool operations were not working anymore.

vCenter service not working properly

From a systematic troubleshooting perspective I recommended him to check that the connectivity between the Connection and Server was doing fine. OSI Layer 1-4 were working well (ports haven’t been changed as well between the VMware versions). For the connectivity check of layer higher than 4 I told him to check the ‘classical-access-logs’ to see a problem with the authentication.

%ProgramData%\VMware\VirtualCenter\vpxd.log
%ProgramData%\VMware\CIS\IMStrace.log

%ProgramData%\VMware\VDM\logs\*.log

and to verify that the service-account has proper vCenter access and the correct permissions set within a role.

And voila –> the service user’s vCenter permission was removed during the upgrade (-> All other permissions were still in place).  Maybe a malfunction during the SSO / AD-LDS upgrade. Unfortuneatly I am not able to have closer look to do a root-cause analysis of it.

Anyway! If you observe similiar issues –> a) Use a systematic approach to verify system-communication or b) check directly the vCenter permissions.

Looking Back at a Day of vShield Manager Troubleshooting

Dear diarrhea diary,

this entire day was f***** up by a VMware product called vShield Manager …

Like this or similar should today’s entry in my non-existing diary look like. It was one of these typical “piece of cake” tasks that turn into nightmares 😀 Literally the task read “Configure VXLAN for the ***** cluster” – easy hmm!?

1. Ok, let’s go: The physical switch configuration turned out easy as it was already done for me 🙂 CHECK.

2. So, naive me, I connected to vShield Manager UI, went Datacenter -> Network Virtualization -> Prepare and added the cluster, gave it the name of the also already existing Distributed Switch and the VLAN ID and let it run. FAIL: “not ready”.

vsm

VSM itself doesn’t give a lot of details but I knew that probably the deployment and installation of the VXLAN VIB package failed. Looking at esxupdate.log I could see a “temporary failure in DNS lookup” (exact wording was probably different).  Looking at the ESXi hosts’ DNS configuration: empty. Cool! Fix -> CHECK. Later I found out that I myself blogged about this a while ago 😀

3. Now lets try again, but first we have to “unprepare” the cluster: removed check in VSM: Error. Of course. VSM didn’t create the Port Group nor the VMkernel ports and now tries to remove them … computer logic 😀 At this point, simply hit “Refresh” and it will be gone. Now we can try the preparation process once more: Error:

domain-c3943 already has been configured with a mapping.

Grrrr … luckily, found this: http://blog.jgriffiths.org/?p=482 To be honest the sentence “VMware support was able to help… and I suggest unless you don’t care about your cluster or vShield implementation that you call them to solve it” scared my a bit, BUT to balls to gain (wait is that right?). WORKS! PARTYYY! But once again: preparation failed (devil)

4. I can’t quite remember which error message or log entry helped me find VMware KB 2053782.  Following the steps sounds simply but hey, why should anything work today?! 😀 Check my other blog post about this particular one. After applying the – I like to call it – “curl”-hack to VSM (see the step before) again, I prepared the cluster one more time and finally the VXLAN VIB could be deployed, BUT …

5. … The Port Group was not created … f*** this sh***. After a 30ish minutes of blind flight through VSM and vCD, I figured out that other clusters could not deploy VXLANs anymore. Due to this insight and a good amount of despair, I just rebootet VSM. Then unprepare, “curl”-hack, prepare … and: WORKS!

vxlan

Portgroup is there. BUT:

6. No VMkernel Ports were created (I ran out of curses by that time). Another 30min passed until I unprepared, “curl”-hacked and prepared the cluster one last time before the VMkernel Ports were then magically created. THANK GOD! So I went ahead creating a Network Scope for the cluster.

I tested creating VXLAN networks over VSM a couple of times and it seemed properly create additional Port Groups. You think the days was over, yet? WROOONG!

7. Next, I tried through vCloud Director. The weird thing was that a Network Pool for that cluster already existed with a different name than the Network Scope I just created. Had to be some relict from before my time in the project. Trying to deploy a vApp I ran into something I am going to write about tomorrow. As this was fixed, I kept receiving this:

7

Telling from the error message, vCloud Director tries to allocate a network from a pool for which VSM has no network scope defined. Those thing did not work out:
– Click “Repair” on the Network Pool
– Create a Network Scope with the same name as the Network Pool as vCD uses some kind of ID instead of the name of the Network Scope.

The only possible solutions I could come up with are deleting and re-creating the Provider vCD or going into the vCD database and do some magic there. The only information on this I could find was in the Vmware Comunities: https://communities.vmware.com/thread/448106. So I am going to open a ticket.

Good night.

© 2017 v(e)Xpertise

Theme by Anders NorenUp ↑