Why is Network Troubleshooting So Hard?
In one of my previous articles, I wrote about what to look for in an NPM solution. On the surface, I’ll continue on that line of thinking, but we’ll be taking a different look at what network tools can and cannot do.
Despite what vendors may tell you, these tools by themself cannot solve your network problems and fix them. Although numerous technological advances have come forth in recent years, it is unreasonable to expect them to be a perfect solution to the myriad of challenges IT teams face daily.
Within this article, we will discuss: what you should expect from them and when good old-fashioned network troubleshooting expertise, experience, and creative problem-solving skills are still required. To make this point, we will use analogies from other domains and touch on the psychology behind why some people expect the software to do everything for them.
To put this into perspective, I am a programmer and have been developing network troubleshooting software for over 25 years, primarily working on Omnipeek and now LiveWire. I didn’t create either of these products. However, they’re both architected in such a way, with APIs all over the place, that I could make a career out of extending them in many ways. During that time, I became very familiar with packets and the analysis and visualizations that can be done.
Fixing Your Network, One Phonecall at a Time
Over the last couple of years, I have transitioned into an SME or Subject Matter Expert, working with the sales team and support engineers. I am on the phone all day long talking to customers. And what do customers want? They want a network monitoring and troubleshooting solution to tell them precisely what is wrong with their network and how to fix it. The closer we can get to this, the more likely a prospective customer will buy our products.
However, it’s rarely the case that lots of us smart people can get on a phone with tools in place and solve a potential customer’s network problem by just looking at the packets on the network. The tools can provide all kinds of clues and point you in the right direction, but rarely can they tell you definitively what the problem is because the problem is not the data or the wire itself. The problem is caused by something that the wire is connected to. The problem is the producer or consumer of the data or some devices in between. This could be a phone, a laptop, a web server, an application server, a router, a switch, a firewall, an authentication server, a DNS server, or something in the cloud.
Time to Repair the Network
Herein lies the other problem; whatever device is causing the issue cannot be placed in a vacuum for observation either. You can’t just monitor a switch and know that there is network latency. Or look at an app server, and know there is application latency. Or an FTP server, and know that a file transfer is taking too long.
It is not until the packets get onto the network, and take too long to reach their destination, that these problems can be identified, measured, analyzed, and visualized. So you see, there is a symbiotic relationship between monitoring and troubleshooting the network traffic itself and monitoring and troubleshooting the devices on the network that are generating and receiving the traffic.
And now we get to the crux of the problem. Today’s networking tools do not tell you precisely what is wrong with your network. And even if the tool gives you lots of ideas about what is wrong, it will not magically fix it or even tell you how to do that.
Getting Under the Hood
To demonstrate why this is reasonable, I will use my favorite analogy: cars. Let’s say your car is slow. It starts and runs, but it just won’t go very fast, or at least as fast as it used to. What do you do to fix it? Most likely, you are not an auto mechanic yourself and don’t have the necessary knowledge and tools, so you take it into the shop, and you ask them to fix it.
Now let’s jump into the mind of the person that is going to work on it. The first thing this person is going to do is ask questions about the problem, like the symptoms, when it happens, details about the car, etc. As an experienced auto mechanic, you may already have some ideas about what the problem might be. You might also be thinking about different ways to fix it. So already, you are using your mind and deductive reasoning to analyze the situation and find a solution. But experienced or not, the first thing you are going to do is run some tests, most likely to reproduce the problem. You have lots of tools for this, which help reproduce the problem and provide all kinds of analysis.
When the Network is Not a Car
Let’s pause here since this is not an article about fixing a car. The point is to draw a comparison between fixing a slow car and fixing a slow network. In both cases, let’s point out that we are talking about a performance problem. If the car was broken or similarly if the network was to break, it might be easier to fix (like how one may simply replace a part like an alternator or a router). But those are not the types of problems we need good tools for.
We have to test and analyze the soft problems like performance to figure out why the unbroken system is not performing at peak capacity. And this is where the tools come in, so back to the car. Disclaimer: I am not an expert auto mechanic, but I do own an OBD scanner, and when I ask my mechanic what the problem is, they will tell me which codes or events the OBD scanner gave them.
Bringing the Network to the Shop
Hard stop! This is so similar to troubleshooting a network. When you report a network performance problem to IT, they are going to use a tool, like Omnipeek, to capture the packets and analyze the situation. Tools like these will give you expert events, which are analogous to the OBD codes.
In either case, the information provided by the tool provides symptoms about the problem, not the problem itself or the solution. In fact, usually the OBD codes I get when my car is running poorly refer to some sort of emissions problem, which more often than not ends up being a bad sensor, not an emissions problem at all.
As an aside, an experienced mechanic who uses the tools to help make a decision will know this and replace the sensor. An inexperienced (or dishonest) mechanic will blindly believe the tool and substitute something else, like a catalytic converter.
Tools of the Trade
Like mechanics, IT people have tools to diagnose network problems. Some are lists, and others are graphs of stats over time. In the case of advanced IT tools, there may be information about latency, jitter, MOS scores, etc. But in either case, it is up to the expert to use this information as clues to solve the problem and know conclusively what is wrong.
And that is the main point I am trying to make here because all too often, I get on the phone with somebody who is interested in our solution, and they say I am looking for a tool that will tell me what is wrong with my network. Or, more specifically, they will say this operation is slow, and I need the product to tell me why. And of course, as a protocol analyst, I will start asking questions, looking at the slow flows and the details about them. But very rarely is the answer obvious, and even if the software can figure it out, it never says exactly what the problem is. And why is that?
Cause and Effect Troubleshooting
The highest level of network performance-related expert events I am aware are around calls and flows and TCP, like latency, jitter, retransmissions, sequence numbers, window sizes, etc… These are all good clues, but again they are the what, not the why. In Omnipeek, each of the over 200 experts has a description, cause, and remedy that is easy to refer to. This helps a bit, but again, only as a pointer in the right direction. What also helps is how well they are integrated into the rest of the solution.
For example, it should be easy to seamlessly workflow from experts to calls, from calls to flows, from flows to transactions, and from there to packets and payloads. Another integration is a dashboard view, like ELK and Splunk, where different types of data are shown in separate widgets or windows, and selecting something in one filters the others. Finally, there is the concept of mashing as many different types of data into a single graph.
Omnipeek does this in the Compass view, providing a single visualization to see relationships between different types of data quickly. At an even higher level (even more) advanced products like LiveNX take in both flow data generated from packets and information about the network devices themselves using SNMP. This allows for the correlation I was referring to earlier between the data on the wire and the devices generating and consuming the data.
How about another analogy: the good ol’ murder mystery. Here, the experts and other analyses are the clues, and the visualizations provide different ways to think about the analysis. However, it is still up to the investigator to figure out who did it. Maybe a stretch, but I think it works because they have plenty of tools and data in both cases, whether it is the investigator or the protocol analyst.
However, they still have to rely on creative thinking, deductive reasoning, communications, and as much knowledge and experience as possible about the surrounding environment. In the case of the protocol analyst, that means operating networks themselves, running into all kinds of problems, and having to solve them. And the more of this they have, the more of the secret ingredient they bring to the table – intuition.
Working Smarter Together
And that, my friends, is why I know that the software alone is not going to solve your network problems. Even though I am a pretty smart guy, who has been developing network monitoring and troubleshooting software for a long time, I am not an IT person with years of experience solving network problems.
When I get on the phone to work with a customer, I can go on all day long about the features, but if we capture some traffic on their network, I am probably not going to be able to explain what the problem is. Having said that, if you do end up on the phone with me, which is likely, I will try and know a lot of intelligent people who can help.
By: Chris Bloom, Lead Technical Engineer