CategoryVMware

Deleting vCloud Director Temporary Data

Recently, I learned a bit about vCloud Director internal database and table structure which I am going to share with you here.

vCD holds two types of tables worth pointing out: the ORTZ and INV tables. The latter store information about vCenter Server inventory (INV) and are kept up-to-date as the inventory changes. This could be due to changes vCD makes itself or those carried out be an administrator. When the vCD service starts up it connects to vCenter to read in its inventory and update its INV tables. The QRTZ tables are used for heart beating and synchronization between cells (at least from what I understood, no guarantees).

Why am I telling you this? Both types of tables can be cleared without losing any vital data. You can make use of this knowledge whenever you feel your vCloud DB is out of sync with the vCenter inventory. This happens for example in the case where you have to restore your vCenter Database from a Backup without restoring vCloud’s database.

WARNING: Following this procedure is completely unsupported by VMware GSS. Follow the instructions below on your own risk or when you have no support for your installation anyway πŸ˜‰

  1. Shutdown both vCloud Director cells
  2. Clear the QRTZ and INV tables using “delete from <table name>”
  3. Start one of the cells and watch it start (/opt/vmware/vcloud-director/log/cell.log)
  4. Start the other cells

Here some SQL statements that will automate step 2 for you:

delete from QRTZ_SCHEDULER_STATE;
delete from QRTZ_FIRED_TRIGGERS;
delete from QRTZ_PAUSED_TRIGGER_GRPS;
delete from QRTZ_CALENDARS;
delete from QRTZ_TRIGGER_LISTENERS;
delete from QRTZ_BLOB_TRIGGERS;
delete from QRTZ_CRON_TRIGGERS;
delete from QRTZ_SIMPLE_TRIGGERS;
delete from QRTZ_TRIGGERS;
delete from QRTZ_JOB_LISTENERS;
delete from QRTZ_JOB_DETAILS;

delete from compute_resource_inv;
delete from custom_field_manager_inv;
delete from cluster_compute_resource_inv;
delete from datacenter_inv;
delete from datacenter_network_inv;
delete from datastore_inv;
delete from datastore_profile_inv;
delete from dv_portgroup_inv;
delete from dv_switch_inv;
delete from folder_inv;
delete from managed_server_inv;
delete from managed_server_datastore_inv;
delete from managed_server_network_inv;
delete from network_inv;
delete from resource_pool_inv;
delete from storage_pod_inv;
delete from storage_profile_inv;
delete from task_inv;
delete from vm_inv;
delete from property_map;

 

Looking Back at a Day of vShield Manager Troubleshooting

Dear diarrhea diary,

this entire day was f***** up by a VMware product called vShield Manager …

Like this or similar should today’s entry in my non-existing diary look like. It was one of these typical “piece of cake” tasks that turn into nightmares πŸ˜€ Literally the task read “Configure VXLAN for the ***** cluster” – easy hmm!?

1. Ok, let’s go: The physical switch configuration turned out easy as it was already done for me πŸ™‚ CHECK.

2. So, naive me, I connected to vShield Manager UI, went Datacenter -> Network Virtualization -> Prepare and added the cluster, gave it the name of the also already existing Distributed Switch and the VLAN ID and let it run. FAIL: “not ready”.

vsm

VSM itself doesn’t give a lot of details but I knew that probably the deployment and installation of the VXLAN VIB package failed. Looking at esxupdate.log I could see a “temporary failure in DNS lookup” (exact wording was probably different).Β  Looking at the ESXi hosts’ DNS configuration: empty. Cool! Fix -> CHECK. Later I found out that I myself blogged about this a while ago πŸ˜€

3. Now lets try again, but first we have to “unprepare” the cluster: removed check in VSM: Error. Of course. VSM didn’t create the Port Group nor the VMkernel ports and now tries to remove them … computer logic πŸ˜€ At this point, simply hit “Refresh” and it will be gone. Now we can try the preparation process once more: Error:

domain-c3943 already has been configured with a mapping.

Grrrr … luckily, found this: http://blog.jgriffiths.org/?p=482 To be honest the sentence “VMware support was able to help… and I suggest unless you don’t care about your cluster or vShield implementation that you call them to solve it” scared my a bit, BUT to balls to gain (wait is that right?). WORKS! PARTYYY! But once again: preparation failed (devil)

4. I can’t quite remember which error message or log entry helped me find VMware KB 2053782.Β  Following the steps sounds simply but hey, why should anything work today?! πŸ˜€ Check my other blog post about this particular one. After applying the – I like to call it – “curl”-hack to VSM (see the step before) again, I prepared the cluster one more time and finally the VXLAN VIB could be deployed, BUT …

5. … The Port Group was not created … f*** this sh***. After a 30ish minutes of blind flight through VSM and vCD, I figured out that other clusters could not deploy VXLANs anymore. Due to this insight and a good amount of despair, I just rebootet VSM. Then unprepare, “curl”-hack, prepare … and: WORKS!

vxlan

Portgroup is there. BUT:

6. No VMkernel Ports were created (I ran out of curses by that time). Another 30min passed until I unprepared, “curl”-hacked and prepared the cluster one last time before the VMkernel Ports were then magically created. THANK GOD! So I went ahead creating a Network Scope for the cluster.

I tested creating VXLAN networks over VSM a couple of times and it seemed properly create additional Port Groups. You think the days was over, yet? WROOONG!

7. Next, I tried through vCloud Director. The weird thing was that a Network Pool for that cluster already existed with a different name than the Network Scope I just created. Had to be some relict from before my time in the project. Trying to deploy a vApp I ran into something I am going to write about tomorrow. As this was fixed, I kept receiving this:

7

Telling from the error message, vCloud Director tries to allocate a network from a pool for which VSM has no network scope defined. Those thing did not work out:
– Click “Repair” on the Network Pool
– Create a Network Scope with the same name as the Network Pool as vCD uses some kind of ID instead of the name of the Network Scope.

The only possible solutions I could come up with are deleting and re-creating the Provider vCD or going into the vCD database and do some magic there. The only information on this I could find was in the Vmware Comunities: https://communities.vmware.com/thread/448106. So I am going to open a ticket.

Good night.

VMware vSphere Update Manager causes VXLAN Agent to fail on install and uninstall

The title of this article is that of VMware KB article 2053782. Following the steps seems simple but turned out to provide a couple of pitfalls and inaccuracies:

Mean pitfall:

Open a browser to the MOB at: https://vCenter_Server_IP/eam/mob/

When I opened this URL to my vCenter server, I received the following error message:

The vSphere ESX Agent Manager (vEAM) failed to complete the command.

The thing to point out here is the slash / after “mob” in the URL! Without this slash it won’t work

Unclear Instructions:

b. In the <config> field, change the value from true to false:
<config>
<bypassVumEnabled>false</bypassVumEnabled>
</config>

Reading this, the way I understood the instructions was to leave the XML data as is but turn “true” into “false” for the bypassVumEnabled element. In the code example they gave, they removed all the other elements but I thought that’s probably because of saving space in the KB article. WRONG! Turned out you have to:

  1. Delete all the XML data from the text field
  2. Paste the code above (config element with nested bypassVumEnabled element) – nothing else!

 

Hope that helps πŸ™‚

 

Nested ESXi with OpenStack

For all of you who want to run VMware’s ESXi 5.x on an OpenStack cloud running vSphere as the hypervisor, I have a tiny little tip that might save you some researching: The difficulty I faced was “How do I enable nesting (vHV) for an OpenStack deployed instance?”. I was almost going to write a script to add

featMask.vm.hv.capable="Min:1"
vhv.enable="True"

and run it after the “nova boot” command, and then I found what I am going to show you now.

Remember that uploading an image into Glance you can specify key/value pairs called properties? Well, you are probably already aware of this:

root@controller:~# glance image-show 9eb827d3-7657-4bd5-a6fa-61de7d12f649
+-------------------------------+--------------------------------------+
| Property                      | Value                                |
+-------------------------------+--------------------------------------+
| Property 'vmware_adaptertype' | ide                                  |
| Property 'vmware_disktype'    | sparse                               |
| Property 'vmware_ostype'      | windows7Server64Guest                |
| checksum                      | ced321a1d2aadea42abfa8a7b944a0ef     |
| container_format              | bare                                 |
| created_at                    | 2014-01-15T22:35:14                  |
| deleted                       | False                                |
| disk_format                   | vmdk                                 |
| id                            | 9eb827d3-7657-4bd5-a6fa-61de7d12f649 |
| is_public                     | True                                 |
| min_disk                      | 0                                    |
| min_ram                       | 0                                    |
| name                          | Windows 2012 R2 Std                  |
| protected                     | False                                |
| size                          | 10493231104                          |
| status                        | active                               |
| updated_at                    | 2014-01-15T22:37:42                  |
+-------------------------------+--------------------------------------+
root@controller:~#

At this point, take a look at the vmware_ostype property, which is set to “windows7Server64Guest”. This value is passed to the vSphere API when deploying an image through ESXi’s API (VMwareESXDriver) or the vCenter API (VMwareVCDriver). Looking at the vSphere API/SDK API Reference you can find valid values and since vSphere 5.0 we find “vmkernel4guest” and “vmkernel5guest” in the list representing ESXi 4.x and 5.x respectively. According to my testing, this works with Nova’s VMwareESXDriver as well as VMwareVCDriver.

This is how you change the property in case you set it differently:

# glance image-update --property "vmware_ostype=vmkernel5Guest" IMAGE

And to complete the pictures, this is the code in Nova that implements this functionality:

  93 def get_vm_create_spec(client_factory, instance, name, data_store_name,
  94                        vif_infos, os_type="otherGuest"):
  95     """Builds the VM Create spec."""
  96     config_spec = client_factory.create('ns0:VirtualMachineConfigSpec')
  97     config_spec.name = name
  98     config_spec.guestId = os_type
  99     # The name is the unique identifier for the VM. This will either be the
 100     # instance UUID or the instance UUID with suffix '-rescue' for VM's that
 101     # are in rescue mode
 102     config_spec.instanceUuid = name
 103 
 104     # Allow nested ESX instances to host 64 bit VMs.
 105     if os_type == "vmkernel5Guest":
 106         config_spec.nestedHVEnabled = "True"

You can see that vHV is only enabled if the os_type is set to vmkernel5Guest. I would assume that like this you cannot nest Hyper-V or KVM but I haven’t validated.

Pretty good already. But what I am really looking for is running ESXi on top of KVM as I need nested ESXi combined with Neutron to create properly isolated tenant networks. The most current progress with this can probably be found in the VMware Community.

Fixing Failed OVF Deployment Due to Integrity Error

We were just about to deploy the latest version of VCE’s UIMP when vSphere Client failed with an error message about a failed integrity check of the VMDK file. Although not the best idea, the fix was easy: When an integrity check is performed at all, OVF brings an additional file: the manifest file (.mf). In our case, it contained SHA1 checksum information about the .ovf and the .vmdk file:

mathias@x1c:/media/mathias/Volume$ ls -hl UIMP*
-rw------- 1 mathias mathias 2,6G MΓ€r  5 19:25 UIMP-4.0.0.2.359-Install-Media.iso
-rw------- 1 mathias mathias 1,3G MΓ€r  6 09:28 UIMP_OVF10-disk1.vmdk
-rw------- 1 mathias mathias  133 MΓ€r  6 10:54 UIMP_OVF10.mf
-rw------- 1 mathias mathias 125K Jan  4 23:13 UIMP_OVF10.ovf
mathias@x1c:/media/mathias/Volume$ cat UIMP_OVF10.mf
SHA1(UIMP_OVF10.ovf)= 881533ff36aebc901555dfa2c1d52a6bd4c47d99
SHA1(UIMP_OVF10-disk1.vmdk)= f175f150decb2bf5a859903b050f4ea4a3982023

Interestingly, the whole OVF data was contained in an ISO file which passed the MD5 check after download. So something must have gone wrong packaging the appliance. I calculated the SHA1 checksum for the file and compared it to the one in the manifest:

mathias@x1c:/media/mathias/Volume$ sha1sum UIMP_OVF10.ovf
881533ff36aebc901555dfa2c1d52a6bd4c47d99  UIMP_OVF10.ovf
mathias@x1c:/media/mathias/Volume$ sha1sum UIMP_OVF10-disk1.vmdk
f175f150decb2bf5a859903b050f4ea4a3982023  UIMP_OVF10-disk1.vmdk
mathias@x1c:/media/mathias/Volume$

VMDK: f175f150decb2bf5a859903b050f4ea4a3982023
MF: 9371305140e8c541b0cea6b368ad4af47221998e

Strange πŸ˜€ Well, the fix was to simply edit the manifest with the correct SHA1 checksum. But please keep in mind that just because we just forged the checksum doesnt mean the data is not corrupt. In this case, as the MD5 check of the ISO was successful we assumed that the data is probably fine an decided to give it a try. In the end, the file was still corrupt and brought errors mounting the root file system.u

Do not run TSM as a vCD Workload!

I just got called in for a Tivoli Backup troubleshooting. The symptoms seen were extremely strange:

The TSM proxy successfully connected to vCenter and a backup job could be started. In vSphere Client we could see that the VM to be backed up was snapshotted. The next step would be to attach die VMDK to the TSM virtual machine but instead it was attached to an entirely different VM πŸ˜€ Of course, the backup job failed.

Looking at the TSM VM, I found out it was part of a vApp deployed through vCenter Orcestrator and vCloud Director. I figured this was probably a bad idea to run TSM proxy in a vCD vApp for several reasons:

1. TSM is going to back up vCloud Director VMs and running that same backup server as a vCD VM itself seemed strange. Any scripts or similar to backup the entire vCD vApp workload would probably try to back up the TSM proxy, too.

2. TSM talks to vCenter requesting the creating of snapshots and attachment of VMDKs to itself. As vCD VM the VM is marked as controlled by vCD and any changes through vSphere Client are not recommended. But exactly this would happen when a VMDK gets attached to TSM for backup.

So the first try was to clone the vCloud VM into an ordinary vCenter VM and shut the vApp down. Booom, works! We resolved the issue quickly but unfortunately, the actual technical cause for this is still unknown to us. So in case one of you knows what exactly was going on, please drop me a mail πŸ™‚

cheers
Mathias

Video Recommendation: Nicira NVP vs VMware NSX

Please take a look at the following questions:

  • What is NSX?
  • What the heck is the difference to Nicira NVP or are they the same?
  • What are the technologies behind NSX and how does it work?

Is there any you cannot answer, yet? If so, I would like to direct your attention to two just great videos on Youtube which will clarify:

OpenStack Networking – Theory Session, Part 1

OpenStack Networking – Theory Session, Part 2

Watching this will be the best 1h 45min you have invested for a while!

Have fun!

vCloud Director: Low Performance Powering On A vApp

I am working in a project including vCloud Director as well as most other parts of VMware’s cloud stack for a while now. Until a couple of days ago, everything was running fine regarding the deployment process of vApps from vCloud Director UI or through vCenter Orchestrator. Now we noticed that starting and stopping vApps takes way too long: Powering on a single VM vApp directly connected to an external network takes three steps in vCenter:

  1. Reconfigure virtual machine
  2. Reconfigure virtual machine (again)
  3. Power On virtual machine

The first step of reconfigure virtual machine showed up in vCenter right after we triggered the vApp power on in vCloud Director. From then it took around 5min to reach step two. Once step 2 was completed, the stack paused for another 10min before the VM was actually powered on. This even seemed to have implications on vCenter Orchestrator including timeouts and failed workflows.

We spent an entire day on trying to track the problem down and came up with the opinion that it had to be inside vCloud Director. But before we went into log files, message queues etc, we decided to simply reboot the entire stack: BINGO! After the reboot the problem vanished.

Shutdown Process:

  1. vCO
  2. vCD
  3. vCD NFS
  4. VSM
  5. vCenter
  6. SSO
  7. DB

Then boot the stack in reverse order and watch vCloud Director powering on VMs withing seconds πŸ˜‰

VMware Single Sign On and Active Directory Logon Restrictions – LDAP Error Code 49

Have you ever had issues using Active Directory User with Logon restriction in your Sphere 5.1 environment? I think so, otherwise you probably haven’t landed here πŸ˜‰

Since I haven’t found any explanations how to deal with this situation I tried to summarize it.

We all know a good role-concept in vSphere is a very important thing. That’s why we typically create our own roles and attach (service/personal) active-directory users/groups to those specific rules.

vcenter_ad_restriction01

So far, this is nothing new for us and not very problematic at all if SSO is probably configured to use the Active directory as an identity source.

vcenter_ad_restriction02

Once we are forced to deal with strong security policies, we might run into some issues that our vCenter User is defined with a logon restriction.

vcenter_ad_restriction03

From a security perspective it definitely makes sense to restrict the access for the Users (e.g. service-accounts) to specific Computers. From a logical point of view, having access to the vCenter (vc01 in the picture) should be enough. Unfortunately things in the IT world are not always that logical, which leads us to the fact that we can’t login to our vCenter after having configured a restriction.

vcenter_ad_restriction04

Credentials are not valid? Sounds for me like a typo or wrong password, but a closer look at the imsTrace.log (located at %ProgramFiles%/VMware/Infrastructure/SSOServer/logs) gave me another hint.

vcenter_ad_restriction05

‘javax.naming.AuthenticationException: [LDAP: error code 49 – 80090308: LdapErr: DSID-0C0903A9, comment: AcceptSecurityContext error, data 531, v1db1 ];’

a short online search for LDAP error codes gave me the information what error code 49 and data 531 exactly means.

vcenter_ad_restriction06

531 means: not permitted to logon at this workstation, even though we are permitted to log on against the vCenter Server.

The only way to be able with this user to connect against the vCenter is adding the 2 configured ldap-server which are used by single-sign on to the User as well.

vcenter_ad_restriction06

vcenter_ad_restriction07

Once this is done, we are able to logon with this user at our vCenter.

vcenter_ad_restriction08

Nevertheless. For many organisations this is not a suitable solution, to grant access to the Domain Controllers itselves. Unfortuenately I haven’t found a satisfying workaround for this scenario so far. I’m not sure what the exact problem is, but it seems that the SSO service is just brokering the user-credentials to one of the configured ldap-servers. If the user has no access to log on at this machine….beduumm…*failsound*…

The only way to deal with it, is by getting into a mixed environment of vCenter 5.1 and 5.5 components. The new SSO component in vSphere 5.5 is much more robust and better working than the old one and regarding to VMware it is supported to use the SSO from 5.5 with vCenter 5.1.

I’m not a fan of such a mixture, but sometimes we have no other choice to to do it that way. Make sure to update SSO and update the vSphere Web-client to the newest Version of 5.5.

Once we have done the upgrade, we need to configure the new active directory identity source with the integrated windows authentication. In the first step, remove the old AD configuration setting. In the second step we add a new identity source and choose Active Directory (Integrated Windows Authentication).

vcenter_ad_restriction09

And voila, we are able to logon to the vCenter with a user log on restricted only to the vCenter machine itself. That’s what we typically want to have.

As a summary, the only thing we can do at the moment for using logon restricted active directory users with vSphere is…

  1. Using no restrictions for your vCenter User.
  2. Running a mixed vCenter infrastructure with vCenter 5.1 and 5.5 components.
  3. Upgrade directly to vCenter 5.5.

Since I struggled several days on this issue, I hope this summary might safe you some time during your vSphere implementation and the active directory integration.

Update Manager Error: Cannot create a ramdisk

We ran into a problem with Update Manager on vSphere 5.1 lately:

Cannot create a ramdisk of size 329MB to store
the upgrade image. Check if the host has
sufficient memory.

The help provided in vSphere documentation says

The ESXi host disk must have enough free space to store the contents of the installer DVD.The corresponding error code is SPACE_AVAIL_ISO.

Yeah, thanks πŸ˜€ That what the error message said in the first place πŸ˜€ But it might as well describe your exact problem, so please check http://virtualdiscussion.wordpress.com/2012/12/13/uprade-of-esxi-host-fails-using-update-manager-fails-vum/ for a possible resolution.

Well, didn’t do it for us. We had enough space … Hmmm … after a while of struggling, it turned out to be lock down mode which caused the error! After disabling it (via DCUI, as vCenter failed with a different error message) VUM remediated the host like a charm πŸ™‚ WTF!?

So far, it didnt reproduce on any other host but let’s see what happens in the future πŸ˜‰

 

© 2019 v(e)Xpertise

Theme by Anders NorénUp ↑