The tldr; is at the bottom, after the story mode writeup. Feel free to CTRL+END or hyper-scroll your way there, or read the machinations of my troubleshooting on your way there.
VMWare makes ESXi, which can be used in stand-alone mode for servers or managed with vSphere, which allows for communication between the hypervisors. You can move VMs around, set up resiliency/HA & automate a bunch of things. It's pretty common for small businesses to use ESXi in stand-alone mode, which you can get a free license for, whereas larger organizations pay for vSphere licensing so that they can orchestrate in their infrastructure.
On a Thursday, shortly after I came onboard at Manufacturing Company 2 (called MC2 from now on), one of their 2 stand-alone ESXi servers failed. We new this, because the Tier1 team alerted us to a LOT of tickets related to DNS, logging in, etc. We did not find out because of any tools that could alert us, because MC2 never used any for servers (only their cloud-managed network equipment and firewalls). Even if MC2 did, it likely would have been set up as a VM on the server that failed anyway, as there were not a lot of other places to install a tool like this.
Luckily, infrastructure is one of my core skills, and servers, storage and virtualization (especially VMWare) are my brightest core skills, so I immediately tried to access the ESXi interface in my browser, but couldn't get to it. Uh oh, no bueno, so I ran to the MDF (which was really just a closet), plugged in the monitor, mouses and keyboard (because there was no KVM) and saw...nothing. No lights on the front bezel, no blinky-blinks from the drives, no fans spinning in the PSUs around back.
Hardware failure. That's bad.
Were there backups? Yes, but the backup server was a VM on that hypervisor. So were both domain controllers for some reason (hence the DNS issues). So were all of the VMs except for our V6 server, which is the only VM on the other hypervisor.
Was there a support contract? A call to Dell answered that: no. It had expired a few months before I joined the firm, and as I found out later, no one seemed to be keeping track of those anyway.
So, plan of attack: we needed to fix DNS, as that would resolve 90% of the immediate user tickets. Luckily an extremely knowledgeable fellow at MC2's owner had another domain controller for MC2 in their data center, so he happily volunteered to change the DNS in the DHCP scopes in all of the branches to their corresponding firewalls, and the Tier1 and Tier2 teams started changing the DNS servers in all of the printers and other static-IP devices to the firewalls, and all the upstream DNS settings in the firewalls pointed to the lone, available domain controller in the sky/cloud/data center.
This got us up and limping, which is a good first step, but I needed to get the server fixed. I pulled out my flashy corporate credit card and immediately placed an order for a support contract on the failed server, then escalated it in order to get it applied right away. Well, there wasn't really a way to get it applied right away. It takes a few days, even when the support peeps at Dell try to escalate it internally, it still takes a few days. Not including the weekend, of course.
We had some wriggle room on the other hypervisor, but not a lot due to 14TB of space being used already, but I instructed Tier3 to grab a spare desktop, install Veeam and get some backups restored. By Friday EOD we'd restored our primary domain controller and we were starting to restore 1 of the 2 file servers (each about 1.4TB in size). We had some breathing room while the restores worked over the weekend, as each file server needed about 16 hours to finish). We'd also determined that the downed hypervisor was not malfunctioning due to any add-in cards, specific CPUs, RAM or PSUs. We'd pulled everything out and still were not able to get any power on activity.
On Monday we were still limping, and in a pretty precarious position. We were experienced reduced services and mostly functional, but the hypervisor was still not working and Dell still had not finished processing the support contract for the server. I needed another server I could slot the RAID controller and SSDs into, as I was suspecting the issue was with the motherboard (due to the troubleshooting before) and I was hoping that the controller and disks were okay. If so, then we should be able to have a full restoration of services pretty easily.
I went shopping for the same model server, and I found a refurb on Amazon for a pretty economical price (the model was 5 years old), so I used the trusty corp CC again and ordered one with similar specs and expedited shipping. It arrived 2 days later, on Wednesday, so we slotted in the controller and disks, crossed our fingers and powered it on.Â
Success! It booted perfectly and all VMs were accessible. I adjusted a few settings in BIOS and ESXi and rebooted again, then fired up ome of the VMs. I had Tier3 handle the delta of changes on the file servers, and over the next few days we were back to normal operation.
Also on Wednesday the support contract for the failed server kicked in and we scheduled the first available technician to come out and fix it. They shipped a new motherboard and a bunch of other parts to us so that they could perform the repairs, though due to some stock, shipping and technician scheduling issues it would be almost a week before we had everything and everyone there.
It took close to 45 days to get the server fixed. This model had a rare, but specific issue with the front USB port and an associated controller chip which, when it failed, would prevent the system from working at all. It also was a required component for some reason, so leaving it disconnected was not an option. I cannot even imagine how nerve-wracking it would have been to limp for almost 2 months before having the server operational again, so I'm confident the decision to purchase the 3rd was the right decision. Now we had 3 servers, all in stand-alone ESXi, plenty of hardware resources (4x4 Xeon Golds with 18/36, 1TB RAM & ~30TB SAS12 SSDs each). I made sure we had support and iDRAC on each, then moved on to the next phase, which is here.
tldr; critical hypervisor failed with no support contract, I ordered a refurbished server of the same type, we slotted in the RAID controller and disks and voila! Back up and running. The failed server took about 45 days to fix completely.
Total cost: ~$4000 + about 3 years off all of our lives