Making IT Right – the Servoyant Way
By: Enrique Krajmalnik
The other night I sat down to watch “Holmes on Homes”. In my opinion it is one of the better home improvement shows out there, its premise being that Mike Holmes is going to come in and “make it right” after previous contractors failed to do the job properly. After walking through the house with the disgruntled homeowner, Mike and his crew literally start tearing the house apart. They remove ceilings, walls, flooring, etc… whatever they need to remove in order to really see and fix the problems. During his discovery process he almost always finds other problems that were not apparent to the homeowner, and in some cases these problems are actually worse than what he was brought in to fix. When they are done, the problems are fixed, the house is finished properly, and everything is neat, clean and tidy.
His approach seems over the top at first, but there is a lot we can learn in IT from Holmes on Homes. To begin with, Mike listens to the homeowner, collecting important data about the symptoms of the problem. He then devises a strategy to get under the hood and find the root cause. He does not believe in patching a problem. He fixes it, so that the homeowner can be problem free for years to come. Too often in IT we merely work around the issue, and eventually the workaround breaks, or the problem manifests itself some other way.
So how do we, the IT professional (warrior,) get under the hood in much the same way as Mike and his crew to see what is going? Fortunately we can do it without destroying walls or having to remove asbestos from old pipes. Installing a comprehensive monitoring system with deep inspection and visibility into systems and components can give you exactly the same type of information Mike and his crew derive from checking the condition of framing, mechanical systems, and other aspects of a home. With a good monitoring system you will have visibility, in almost real time, into the health, performance and behavior of systems and applications.
Here is a great example of a real situation of our fellow Voyants (MSPs using Servoyant to provide service to their customers) recently shared with us. The client’s network is a diverse network spread out across 20 locations throughout the United States. There are approximately 120 total users, and they are heavily dependent on their ERP package, which is delivered via 4 load-balanced terminal servers to end users across all locations. The core infrastructure is located at a data center and consists of HP servers running in a Hyper-V cluster along with dedicated database servers. Everything was running great until the latest ERP upgrade.
Although everything ran fine during the pilot of the new ERP package, as soon as the new version was rolled out to the entire company, the problems began to manifest themselves. Entering orders would take 30-45 seconds just to change screens. Querying inventory could take 3-4 minutes per item being queried. This became the hottest topic of conversation among the employees, and neither they nor management was happy about what was going on. Instead of turning to the ERP people, management turned to their IT provider, our fellow Voyant. It was a good thing they did.
Traditionally the solution to these problems would just be to throw more hardware at it. If something is slow, then it must need more memory, faster processors, etc… This was the conventional wisdom of the past. When you are dealing with live, production systems, a forklift upgrade may not always be an option. Plus how do you explain to the client that they will need to upgrade all of the equipment they just purchased six months ago? No matter how you slice it, we can no longer make blind recommendations in today’s demanding IT environments. The good news is that with the right tools in place, you do not need to guess. You can find the root cause and make the right recommendations, just like Mike Holmes.
To wrap up the story, our fellow Voyant took a pragmatic approach to the problem reported with the ERP package. They began by reviewing the telemetry collected through Servoyant. They focused on the terminal servers as well as the database backend. The tier 3 engineers looking into this immediately spotted 3 potential root causes – the most likely being that the backend server was I/O bound. After adding several tests to the servers in order to try to rule out other possibilities, the engineers continued collecting the data via Servoyant, increasing the frequency of checks. At the same time, the managers at each location were instructed to contact the engineers when the systems were behaving particularly slowly. The engineers were able to plot the times of the calls to periods of high disk I/O on the backend, confirming their hypothesis.
The backend storage for the SQL database was stored on a storage array, with data being housed in a shared RAID 5 container. To resolve the issue, the storage array was repartitioned so that the backend database was allocated 6 drives in a dedicated RAID1+0 container. As a result of this change, disk I/O was significantly improved, and the problem was resolved. They were able to provide the client charts of I/O both before and after the change, and the results were dramatic and obvious. In the end, the customer was pleased, and nobody had to tear down any walls or ceilings.
Getting under the hood and seeing what is going on behind the scenes is the only way to do things right. As professionals, we need to take a cue from Mike Holmes, and make recommendations based on facts and data in conjunction with our experience and expertise. Having the right tools is necessary for providing the level of support customers of all sizes have come to expect in today’s business environment.