Who Will Debug? - TVP Strategy

There is a growing push for people to learn less about the systems in which they run their applications. It started with converged infrastructures and moved into hyperconverged, and now I see it continue to grow with Docker and other container technologies. This puzzles me. While it makes the developer’s life easier, does it really make anyone else’s life easier? Do we really need to consider the stack anymore?
In highly distributed systems, there are several key areas still that need consideration, knowledge, and aptitude: network, storage, security, application APIs, and code. I see most people concentrating on application APIs and code while leaving the other areas to be handled by someone else or ignoring them due to an “it just works” mentality. What has happened to the systems engineer of the past? And more importantly, what happens when the system goes wrong and you have to debug it?
DevOps has an answer to that: kill it, and start over. Well, that is great, but my cattle are in herds; what impacts one cow could impact the herd. So I need to call in a veterinarian to diagnose my cow’s illness and determine if the rest of the herd is infected with a problem. I wrote on this mentality in the past. I like to herd my workloads around rather than kill them and start over. Why? Mainly because I want to learn more, solve the problem, and then move on.
I think the “kill and start over” mentality sprang from Microsoft’s method of debugging Windows: “Just reboot; it will fix everything.” But that is just not true. Rebooting does not fix anything. It masks everything, but the underlying problem is still there.
For example, I was told by a group that its systems would enter read-only mode on a regular basis but that if it rebooted the hypervisor, everything would be fixed for a set amount of time. Eventually, that failure spread to the entire cluster, and reboots become more and more painful. The ultimate root cause of the problem was failing but not failed hardware. This is hard to detect unless you know exactly how the system works, what to look for, and how to read the myriad logs within the stack from the hardware up to the hypervisor, to the guest operating system, and eventually to the application log for all nodes in the cluster. It was, after all, a cluster issue.
That one problem makes me wonder if hyperconverged architectures will have the same issue and the only solution will be to reboot the environment, or just to switch out all hardware without first finding the issue. I just do not have a warm feeling that those building and selling EVO:RAIL and other HCI environments have the skills to do deep root cause analysis. At the moment, most tools that claim to do root cause analysis are not very deep and have too many false positives, or require you to be able to read the myriad logs once more to find the real answer. vRealize Operations, with its vRealize Log Insight integration, requires just that. Ultimately, it is the logs to which you refer.
When looking at Docker/containers and HCI, one question to answer is how do you determine the true root cause of any problem? For HCI, this implies that support departments should contain senior-level, knowledgeable people who can work off a script, do the research, and provide an answer quickly and with much thought and consideration. I do not see many container or HCI support groups that include senior-level people.
So, who debugs your stack? How do you find the root cause of any problem? This is, after all, still required without pointing fingers. Furthermore, do the vendors who provide HCI have the knowledge and aptitude to provide this level of support?

2 replies on “Who Will Debug?”

Ed Grigson says:

June 30, 2015 at 4:22 am

Actually I don’t see this as a major problem, though it’s food for thought.
Most of the converged/hyperconverged systems present an abstracted layer on top of the existing ‘stack’ but you can still get all the low level info if you want – the individual components are still there, just pre-integrated. The vision is that this reduces the day to day complexity but I agree complexity is still there under the hood. As these solutions become more popular and include more functionality it could mean buyers adopt more technology compared to previously (maybe not consciously) but as long as the solution does what it claims (reducing operational overhead/cost) then that’s progress. We *should* be worrying less about the details and moving ‘up the stack’ while still having the ability to deep dive if required. I remember being able to dive into DOS memory allocation (remember HIMEM.SYS?) but over time that became less relevant – same concept.
I also wouldn’t single out converged/hyperconverged vendors support as being any weaker. Some vendors (Nutanix certainly come to mind, as do VCE) have a glut of very smart and knowledgeable engineers although some startups may offer a different experience – smaller staff numbers but more attention maybe?
1. admin says:
  
  June 30, 2015 at 8:08 am
  
  The reason no one looks at himem.sys anymore, is that DOS is no longer in use, yet the underlying hardware is. Stacks are getting very complex not less so. That complexity may make operations easier, but when there is a problem many companies will not have the folks on staff to handle it and have to rely on the support staffs of the hypervisor vendors, hardware and storage vendors. These vendors will point to the HCI vendor such as VCE or Nutanix. While they have very smart engineers but they are not the ones contacted when there is a support question, it may take quite a while before you get a hold of one of them. It really depends on how the purchasing companies approach their problems. I am not sure they will call the HCI/Converged vendor first or even second.

Comments are closed.