The tldr; is at the bottom, after the story mode writeup. Feel free to CTRL+END or hyper-scroll your way there, or read the machinations of my troubleshooting on your way there.
In the commercial construction industry there are a number of CPQs: Configure, Price Quote. These applications allow sales teams to configure products to customer specifications and deliver an accurate quote (assuming pricing, BOM, etc. are up to date).
Cyncly's/Soft Tech's V6 is one of those applications, and it is critical to Manufacturing Company 2 (hereafter known as MC2), because they do aluminum extrusions for walls, doors, windows and a lot of other really cool building elements.
A number of years ago Soft Tech recommended setting up the application on a Windows server with as many RDS CALs as you needed for your users. This is to say, they recommended a monolithic server design: MS SQL, V6 application and the user space were all on the same server. This actually works fine if you have a handful of users, but what if you have 50, 100 or even more than 100 users? Shouldn't it be fine? Just add more CPU & RAM to the server, right?
This is the approach MC2 took, as it was in line with the manufacturer's spec and recommendations. MC2 even played it smart and virtualized the server in a standalone ESXi 6.7U3 installation, just in case they needed to add a lot of CPU & RAM to the server and, to be honest, bare metal is usually the wrong approach these days.
Well, performance was bad. Really bad. Like it took 45 minutes just to save some quotes, and that was after waiting 20-45 minutes while it calculated your build prior to saving. It was estimated that approximately $250,000 of payroll was spend each month with >100 people waiting for chunks of time like this multiple times a day. That's...a ridiculous amount of waste, so MC2 did, in fact, add a lot more CPU and RAM to the VM over time. It had 52 vCPUs and 512GB RAM when I entered the picture, which is an astounding amount of resources for a single app.
A few months after I came onboard this issue was laid in front of me. Why not sooner? Who knows? I had multiple meetings with various business managers and leadership prior to hearing about this issue, and no one brought it up before, but once I heard about it I determined that amount of waste would stop with me, because it made no sense to allow it to continue.
Seeing as I had the V6 and IT Operations teams under my umbrella, I set out to get some history, context, examples and some vendor calls going so that we could claw back some sort of performance gains. Soft Tech personnel encouraged us to rebuild the server according to their current standards, which is to separate the MS SQL server from the RDS/application server. This idea sounded like it might have some benefit, so we proceeded to build two new servers in the datacenter that MC2's owner had for their subsidiaries.
Once built we arranged for some testing with 10 users along with the V6 team and...it was slower than production. This was an unexpected outcome as we were actually confident we would get a performance lift, especially considering that there was only a fraction of the normal users contributing to the tests. But the worst part was, now that the SQL and app servers were separated, the slowness was all on the app server and there was barely a trickle of data going to the SQL server.
We hopped back on a call with Soft Tech and put our heads together, but most of their documentation pointed to DB tuning & maintenance, but the DB hadn't even come into play with the performance we were seeing.
A backup plan I'd formulated was seeing if we could install the V6 application on some of the users' actual computers instead of relying on a server, as many had pretty decent CAD workstations. We pulled one out of the standby inventory and set it up and sure enough, it was much faster than production, even though it had far fewer CPUs. Out of curiosity, I checked the CPUs on the cluster in the data center and even though they were the same generation as those in the production server, they were Xeon Silvers, which are mid-tier and are not as "fast" as the Xeon Golds we had been using in production. The workstation? Much higher clock speed.
My brain was trying to put it together: fewer, higher-clocked cores were faster on the workstation, fewer users were slower on the Xeon Silvers in the data center...it was the CPUs. Not the number of them, but the frequency, gigahertz...raw speed that was the difference. How could I confirm this? How could I follow this thread through with actual performance data?
Well, due to another project where I had to acquire a 3rd server and was in the middle of converting the production ESXi servers into a vSAN cluster, I had a spare server I could vMotion things off of and "upgrade" the CPUs to a different Xeon Gold model. And I did just that. I took the 4 Xeon Gold 18 core 2.3Ghz CPUs out and installed 4 Xeon Gold 8 core 3.8Ghz CPUs. Once it was done I arranged some downtime and vMotioned the V6 server (all 4 TB of it) to the "upgraded" server and we tested production again, and we saw an immediate 20% performance uplift during regular production hours.
20% of 45 minutes * 2 is about 18 minutes, so an hour and a half was reduced to...almost an hour and a half. It was something, but not enough to actually call this a success, so I needed to understand WHY this change gave us some performance, as most virtualization environments for years have relied on more cores at middling clock speed, and CPUs with more cores meant servers with fewer CPUs, fewer sockets, smaller chassis, etc. It was more efficient...unless you're talking about V6. We could never actually get the production server to use all of the vCPU assigned to it. It would kind of just...plateau at about 40-50% usage.
The V6 team and I put our data together and pushed back on Soft Tech. We needed to understand why, and the pieces all fell into place when they mentioned that they also recommended that customers who had more than 20 or so users set up a load balanced RDS cluster.
Something clicked. Fewer cores at higher frequency, vCPU usage plateau, load balanced RDS after 20 users...this was a single-threaded app. That fact alone would explain everything we'd discovered in testing. And that was the answer. In fact, on the V6 roadmap was a "performance enhancement" where the saving process would be moved to a new thread, which should alleviate some of the saving performance, but that doesn't make it a multi-threaded app, and about half of the performance issues were prior to saving.
The solution was clear: Swap all of the CPUs in the (soon to be) vSAN cluster to the 8-core Xeon Golds, set up some RDS servers in a cluster and test again. This time, the results far exceeded anything we'd envisioned: 200-600% performance uplift, and that was with 20 users in the 2 RDS servers with 16 vCPUs and 64GB RAM, accessing the test DB (which was for some reason ALSO on the production server), and we did this with quotes that were designed to be monstrously huge, AND we did it during business hours. Barely a hiccup was noticed on the production server, because we'd already come to understand that the performance was app-heavy. not DB-heavy.
We took 2 weeks to perform some additional tweaking to the RDS servers, install printers, make sure the policies were in place, etc. and started scheduling teams to move over. "Night and day" was the feedback we received, and the beauty of this is that at any time we could add additional servers by cloning one of the RDS servers, even during business hours, so we could scale this solution based on the number of users pretty easily.
tldr; performance was abysmal, so we worked with the vendor to rearchitect the server. The split-up of the server and move to the data center had worse performance, so we tested on a decent workstation, which had a significant increase in speed. The vendor admitted the app was single-threaded, so we downgraded our server CPUs to 8-cores with higher clock speeds, then set up a load balanced RDS cluster and saw a massive performance uplift. MC2's users are now expected to put out 200-600% more work (just kidding).
Total cost: ~$6,000 (for the CPUs) + some goodwill from MC2's owner when I started asking about why their servers were slower.