Art of Diagnostics
Modern IT support is commonly made up of tried and tested processes and procedures occurred over time using the experience of many people. This can be a blessing and a curse depending which side of the fence you sit on. For inexperienced people this is a great help. They follow a guide from step one through to completion. Its quick, simple and the end customer receives a quick fix. (In theory anyway)
But life isn’t that simple. Issues will of course present themselves that are not common, prescriptive nor planned for. It will typically fall to someone to diagnose and find the fault. Those with little experience will look to those who do, to find the answer.
In an ITIL orientated organisation, you’ll find teams of people dedicated to working on systems with their respective skill sets who will resolve their own issues. On paper this is ideal. Issues should be solved quickly, right?
Too many times now I’ve been passed a ticket where the customer reports no Internet. 1st and 2nd line have come back with no resolution. The broadband provider has reported there are no issues so the ticket bounces between the networks and server teams for a bit until both say it’s not their problem, so goes back to 2nd line. All the while, the customer just wants working internet. Someone somewhere must resolve the issue themselves, or prove the problem sits with a particular team.
Chances are, if you’re reading this, its left to you. The question is, how can you resolve issues in the most efficient way? There is no single answer of course, but there are several methods I’ve learnt over my many years in IT Support.
Step 1 – Understand the problem
This might seem obvious, but it’s often overlooked. Its not sufficient to simply say the problem is the Internet is not working, or an application keeps crashing. That’s the outcome of the issue. Look deeper. It isn’t enough to just read a ticket in a queue - all you are really doing is reading someone’s interpretation of the problem.
I know it can be considered a sin in a modern IT support function, but maybe its quicker (and easier) to speak to the customer directly. Ask them to show you the problem. They will be grateful of the direct support from someone who can fix the issue.
Step 2 – Understand the technology behind the problem
Whatever the system is, its based on an underlining technology. Be that IPv4, SQL, or Microsoft Windows itself, the are many levels. In the case of a connectivity problem, work out its connection method. Is it via a network, wifi or even via the web? If it is, consider how networks work. The OSI Model is important. Rule out the physical layer. A simple test could be to test another network software package that exists beyond your problematic server. If it’s a web app, try accessing a simple news website. If that works, you’ve ruled out the physical layer instantly. So now work your way through the other OSI models.
Does the network traffic need routing? Or is it on the local LAN. Test something similar if you can. Even a simple Ping can help you. This is a test of the Network and Transport layers. At this point, I would also rule out firewalls as well.
If you’re still stuck, you will need into more software orientated Layers. This will include sockets which is fairly low-level software and OS functions. There are numerous tools to use to help diagnose these issues. Netstat being the obvious. But there are external tools as well including those from Mark Russinovich called PSTools. Completely free, and extremely powerful and useful.
It’s ok to install and use 3rd party tools to diagnose issues. There is no rule that says this is a bad idea if they are from a reputable source.
Side Note – PsTools
PsTools is a great example of completely free software, provided with the genuine aim of helping the Windows community. The suite of tools is almost legendary now and includes command line utilities to see which files are open, monitor open network ports, processes and DLLs. It is a must have suite of tools for any advanced diagnostics. Oh, and is entirely free. PsTools - Windows Sysinternals | Microsoft Docs
These become more useful as you move up the OSI model, especially at the Application layer.
In the case of network connectivity, a common fault tends to be DNS but its often overlooked. Is the DNS name being resolved to the correct IP? This can be spotted using the OSI model as described above. Yes, of course, right now that seems a really long way to resolve the issue and typically experienced engineers will look for this within the first few diagnostic steps. But this isn’t necessarily taught to the up-and-coming IT Support Engineers.
The OSI model is just one example which could be used to help diagnose connectivity issues.
The key point is to understand the technologies being used. This will inform the structure if your investigation.
Step 3 – Logs mean patterns
Any good software will provide the ability to log. Some are automatically generated and some need manually triggering. RTFM is best advice here. Speak to the supplier or Google it. Logs might just give you the answer or provide more information as to where the problem is.
Don’t forget Windows Event Viewer. If it’s a mainstream software vendor, they might log to a text file but instead use the Event Log.
Use whatever text logs you can. Some are unbelievably difficult navigate through. So, download a good Log Viewer application. I use LogExpert. Its free and open source on GitHub. (Source: LogExpert - GitHub - zarunbal/LogExpert: Windows tail program and log file analyzer / Download: Release 1.8.7 · zarunbal/LogExpert · GitHub). Incredibly powerful and well written. It can highlight key words, like ERROR which is remarkably powerful when you’re sifting through a 2mb text file.
Once you work through the log and find an issue, see if you can spot patterns. If a problem repeats randomly, use the logs to get the times and dates. You might find these random events occur at specific or predictable times. Evaluate external causes. Poorly written applications may crash if there is a performance hit either locally or on the network. The logs may show it occurred at a time when an Antivirus scan triggers, or perhaps a schedule task starts.
You’ll notice all the steps so far are just working out the problem. That’s 90% of problem solving. The easy part of fixing it, is just 10%
Step 4 – Think outside the box
On the face of it a problem can seem obvious yet make no sense. Consider this example. Application A crashes on start up. Its not a mainstream product and is something written specifically for a business or industry. On start-up it shows a message “Object not referenced to an instance of an object” and then exits immediately when the prompt is closed.
This message is clearly unhandled by the developers (which is not necessarily a fault of the developer by the way) and is the sort of error that will have the developers and support colleagues groan because it’s a useless message to show an end user. Having said that, it has a meaning, but its directly related to the work the software was doing at the time which isn’t helpful to us.
Assuming you’ve followed Step 1, and seen the problem for yourself, you’d move onto Step 2. The error happens at application start-up. You can safely assume it’s occurring when initiating something behind the scenes.
If it’s a network application, does this problem occur on other devices? Is it unique to this device?
Does it use an SQL database? If so, a simple technique to test SQL connectivity is to open the connection string from Data Sources (ODBC). But of course, if you have a connection string, just try and ping the device. An error won’t prove what the problem is, but it can rule out the Network layer, so is likely to be a software issue.
If SQL is all good. What else does the application do when it starts. If the application has no logs or has an option to enable logging, check the event log. Sometimes the root cause will be logged, especially if it’s a DLL error.
You might find Event Viewer is showing a spooler error and this happens to trigger when the application crashes. Now we have a route to investigate. The Spooler is the print sub system within Windows. Have a look at the printers. Try and view the properties of each printer. If there is a driver issue, you’ll find out quickly.
In this scenario, all printers responded without issue. Try printing from another application.
Either the spooler error is a coincidence, or it remains the cause of the problem.
Before you continue reading, would you have given up by now? Perhaps uninstalled or reinstalled the problematic software? Or gone as far as re-install Windows?
Experience tells me, this cannot be a coincidence so it’s time to start thinking outside the box and start asking questions.
Why is this application causing the spooler to throw an error? Why is this application doing anything with printers during start-up? Let’s have a look at the configuration of the application. Have a look for any settings files. Based on the message, it’s probable this application is written in .Net. Settings can be stored anywhere on the system where the user has write permissions to. If the application requires elevation, it could be the settings are stored next to the application executable (naughty Devs…). Otherwise, it’s likely to be in the user’s profile somewhere.
In this scenario, I found an XML configuration file and one particular setting was called “Default” and included a UNC path to a printer. Can you print to that printer? You may have guessed, but this actually happened to me and you might ne surprised to learn, that printer didn’t exist. It had been removed. When the application was starting it was trying to connect to the printer, but it didn’t exist and was causing the error in question. By removing the dead printer from the XML, the application started and immediately defaulted to the system default printer. Error gone.
Question is, did you immediately think of printers when you saw that error message?