BladeSystem Firmware Upgrade: It went wrong!

There was new firmware for my HP BladeSystem so I chose this week end to do the upgrade. Perhaps that was a mistake, but I was having connection issues to my Interconnects and reporting was not happening. I also thought it would help with my non-working small LCD screen. That was also not the case. So the saga begins innocuous enough, with a successful but odd upgrade of the OA, then onto virtual connect devices, and finally was going to end with the brocade switch upgrade, leaving the blades firmware to be upgraded next time.

The OA upgrade went fine, but it took one of my nodes offline causing an HA event to occur. But only one of the nodes, not both of them.

The Flex-10 upgrade seemed to go fine, but the new firmware did not come into play until I rebooted the Flex-10 modules and that is one everything went pear-shaped.  I lost complete connectivity to the entire virtual environment within the BladeSystem. It took a few hours to find the issue, but once I did, it was easy to solve. I needed to configure Enclosure Bay IP Addressing (EBIPA) to have the proper IP addresses for the HP Flex-10 modules in order to login to the Virtual Connect Manager to complete the upgrade. Once that was done, everything worked as expected.

However, I now have another issue and one perhaps related to why the HA event occurred. Apparently I am experiencing a power condition event which is now keeping the blade from rebooting properly. A quick search of Google pointed me to a site that stated they had the same problem and had to physically remove the blade from the enclosure and in effect re-seat it. Well that worked.

Given all the issues with this upgrade, upgrading the firmware on the blades themselves and the Brocade switches will just have to wait for a few weeks. With EBIPA properly configured now it should not be such a chore.

vSphere Upgrade: Migrating to Ephemeral dvSwitch Portgroups

There are two ways to solve the issues with dvSwitches I spoke about before. The first is to place vCenter onto an administrative per host vSwitch. The second is to create new dvSwitch portgroups but first ensure the portgroup is marked as ephemeral. But if you already have portgroups, how would you migrate from one portgroup to these ephemeral portgroups. In theory, ephemeral ports do not require vCenter to be active in order for a port assignment to take place. So how do you migrate from Static dvSwitch portgroups to Ephemeral dvSwitch portgroups?

Figure 1: Making Ephemeral Portgroups

It sounds like a difficult process but all in all it was not. All I did, as per Figure 1 was to add a new portgroup with a different naming convention and make it an Ephemeral port binding portgroup. Then all there was to do was to migrate the VMs from the old port-group to the new port-group.

In theory this should be all I need to avoid the issues with the default static binding. Such as not Failing-Safe but failure to boot VMs when vCenter is no longer around.

An alternative was to create an administrative regular vSwitch per node, which you can see in Figure 1, is the approach I originally took. But given that I like alternatives, this is definitely one method to take. In a resent reboot of my blade servers OA, this approach seems to work just fine for those VMs already switched over, but then again I do ensure vCenter and those VMs it requires boot first on a standard vSwitch. So I have not lost any functionality, but have gain some if vCenter is not running for any reason.

 

vSphere Upgrade: vCenter Crashed: Transaction log for database full

I came back from EMCworld 2011 and found that my vCenter server had crashed while I was away.  This is a fairly uncommon issue so how do you debug such things:

  1. Attempt to restart the vCenter server
  2. If that fails look at

Inside, this log if showed me the following message near the end:

“Transaction log for database ???? is full. To find out why space in the log cannot be reused….” Continue reading “vSphere Upgrade: vCenter Crashed: Transaction log for database full”

vSphere Upgrade: Migrating vCenter/VUM Databases

The other day, yesterday to be exact, I completed a long standing task of my vSphere Upgrade. Migrate my vCenter Server database from the vCenter Server to a dedicated MSSQL Server. I have always wanted to do this, but licensing and other issues always got in the way. There were three reasons for this change:

  • I needed a MSSQL server for other tools such as HP Insight Dynamics
  • I have been installing Application Performance Management Tools and it would be very cool to see how vCenter behaves in a dynamic environment
  • I have a need to add more hosts and therefore more VMs. So needed the space to grow.

The steps are remarkably simple, if your MSSQL databases are the same version, which mine are as I went through the upgrade process from MSSQL 2000 at the beginning of this migration. These steps are:

  1. Use the Microsoft SQL Server 2008 Import and Export Data (64-bit) tool to migrate your data from one MSSQL Server to another. Be sure you create new databases on the target MSSQL server and move both your vCenter and VMware Update Manager (VUM) databases.
  2. Update the 64-bit DSN for vCenter and the 32-bit DSN for VUM
  3. Follow VMware KB article #1003928 to update the username and password used by vCenter to access the database
  4. Follow VMware KB article #1015223 to update the username and password used by VUM to access the database

After all that was completed, vCenter connected and worked just fine. I had no data loss, which is what I wanted. But VUM did not work. I received ‘Failed to Load’ errors when looking at the VUM screens. A search of VMware’s KB articles and VMTN stated I needed to upgrade to VUM version 4.1 Update1, which I did. The problem persisted. All in all not a very good way to end the day.

So I asked myself, is the VUM data all that important to me? In some ways it is, but in many it is just not that important, the same information is already on each ESX host (when updates occurred). So with this realization, I reinstalled VUM and choose to use a Fresh Database. Viola, VUM started working again. So apparently VUM encodes something into the database which caused this failure, but since the update data is retrievable from VMware, using the old database was not all that important to me.

For those who use VUM to upgrade VMs or have lots of hosts, VUMs database may be important to you but since I only use VUM for hosts and not VMs and I had other means to find the same data, it was not a huge issue. My major issue was preserving my vCenter database, which this method of migration did admirably.

So now do I remove MSSQL 2008 from my vCenter server or leave it. I choose to leave it as vCenter requires the SQL Native Client 10.0 and there is a chance if I remove all of MSSQL 2008, the client will also be removed as it is not a normal part of Windows 2008-R2 but I did remove the old data.

Catching Spam Redux

I get tons or email and quite a bit of it is SPAM these days. To combat this I use MailScanner with Postfix, ClamAV, and SpamAssassin. I also setup special mailboxs on all email accounts specifically so that the users can classify mail as either SPAM or if necessary as HAM. Once a week or so a process runs to learn from the SPAM folders. I thought that process was working quite well. It turns out I made a simple goof that has kept my SpamAssassin Bayesian Filter from being able to read my Bayesian database.

Yet, the learning process worked flawlessly for either SPAM or HAM. Why HAM? Because occasionally I have to go through all the caught  SPAM email and unlearn that message as SPAM for my users.  This process also worked quite well, but I was constantly getting flooded with the same old SPAM messages. So I need to dive deeper.

The problem is that all the MailScanner, Postfix, and SpamAssassin code runs as the user “postfix”, while the Bayesian Learning process stores all its data as the user root as it runs as root. Actually, it stores the Bayesian databases as the user root. Alas, I had a permission problem and none of the tools told me this was the case.

The fix was to move my Bayesian database from the /root/.spamassassin directory to the /etc/MailScanner/bayes directory and then change the owner of those files to be “postfix”. Then I created a symbolic link from /etc/MailScanner/bayes to /root/.spamassassin which allowed my current Bayesian learning scripts to continue to work. With a simple change the SpamAssassin configuration for MailScanner and a reset of MailScanner finally was solved.

The problem is finally solved and email I have marked as SPAM is finally being treated as such. Such a simple issue, I wonder why SpamAssassin was just not complaining it could not reach the Bayesian databases. For something this serious, the error should have been made available somewhere.

vSphere Upgrade – Moving to dvNetworking Take 2? Update 2….

Since I adopted vSphere, I have been meaning to move to distribute virtual networking, but other things got in the way, such as my upgrade to a blade infrastructure as well as just general maintenance.Well I finally gave it a try. I have 4 basic networks, each for their own trust zone.  3 of these 4 migrated quickly and easily, but the last one was proving a bit difficult as it contained the service console of the vSphere ESX hosts as well as the administrative tools to manage the vSphere environment.

My first attempt at migrating this all important network failed horribly. I lost connectivity to everything. Here is what I did on that attempt:

  • Used Manage Hosts to add each host to the new dvSwitch I created and the necessary portgroups.
  • Assigned the SC to one of the dvSwitch portgroups
  • Assigned NFS/iSCSI to one of the dvSwitch portgroups
  • Assigned the VMs to other portgroups

The task started but got as far as assigning the SC before the systems became in accessible. Apparently this was not the appropriate method. My thought was that the dvSwitch Manage Hosts code would download to each ESX host the necessary commands to make this happen without needing anything extra.

I was sadly mistaken. I in effect lost connectivity to everything to manage the systems and once the SC lost connectivity to my isolation address, VMware HA powered off all my VMs. What a mess. To fix, I had to go back into the service console and run such commands as ‘esxcfg-vswif’ and ‘esxcfg-vswitch’ to migrate the service console back to the appropriate networks. What I found out and why this happened was that the assignment of VMs to portgroups did not happen, the SC was migrated, the dvSwitch Portgroups were created.  One of those VMs to migrate was the firewall for the administrative network. While half was correct, the admin side was NOT correct.

Take 2

Since I am now using HP BladeSystems, I added a second uplink to my Flex-10 interconnects to my main network switch. I think went into one ESX host and added it as a ‘CONSOLE’ network for administrative purposes. But since it was a BladeSystem I had to first power off the host, which forced a vMotion of all the VMs off the blade. Once done, and the new network was available, I assigned it to the service console leaving the original vmnic to be used for the dvSwitch.

Part of this process was to migrate all the administrative VMs back to this host with the new network. While that was happening I created the dvSwitch and all the required portgroups then using the Manage Hosts aspect of the dvSwitch added the host back to the dvSwitch and assigned the now available vmnic to it as well as the appropriate VMs. Once that was finished for all hosts, the next stage was pretty simple.

That stage was to add all the other hosts to the dvSwitch and then finally move the Service Console port back the appropriate dvSwitch. Which went off flawlessly. Now all is in order, another vMotion of the VMs from the host with the CONSOLE network so that I can remove that network.

Once this was completed, I am now fully using dvSwitches for everything but a few security VMs from vShield. I will have to reinstall the vShield components to get everything working appropriately.

However, I noticed all the vMotions were seriously slow. That is another story.

UPDATE: Had an issue with vShield causing all sorts of issues due to the VFILE SCSI filter not acting properly (yes this is part of vShield Endpoint). vMotion’s were taking forever and stalling out, so wanted to remove that, but DRS came into play and caused VMs to be sent all over my environment. Long and short of it was that the host holding vCenter was reset, this caused HA to fail, as well a manual reboot of all the nodes. Apparently, when a VM loads its dvSwitch port number needs to be selected (it is in the configuration file) but since the VM cannot communicate to vCenter, the VMs could NOT boot.

The solution was to temporarily put the service console and administrative tools (such as vCenter) onto a regular vSwitch. Then once vCenter was booted, properly bring all the VMs backup and running.

UPDATE II: Happened again. So this time I created an administrative portgroup on the VMware vSwitch I left and transfered to that vSwitch: vCenter, Service Consoles, Active Directory for management, and a few other critical bits. Then set the boot order to be those items on this vSwitch to be FIRST within the cluster. Once vCenter is available then all else should be fine.