God Help the Help Desk


Add a Comment Now - We Want to Hear From You

manishchacko2.jpgBy Manish Chacko

This is a story about a typical help desk in a large organization in the continental United States. This is probably true for other parts of the world as well, but I'll refrain from making claims I cannot possibly back up.

With no standard framework for Application Delivery in todays IT environments, generally speaking, the current corporate rules for IT troubleshooting are:

Step 1: Wait for someone to call the help desk.

There really is no Step 2. If nobody calls in, surely everything is going well, right? Huge corporations invest millions of dollars in hiring people to staff help desks, purchasing software to run help desks, and running meaningless reports. Yet, they leave the important troubleshooting to the end user, an end user who doesn't even realize that he's performing that task! End users who don't even know how to spell GUI, much less execute a complex diagnostic procedure.

Of course, the end user isn't exactly going to be responding with scientific measurements used to measure the performance of mission critical applications. He's going to say, "The network is acting slow," or the even more vague, "The network is acting weird."

Lets look at two different scenarios.

Scenario I

The user calls and complains that (yes, you guessed it,) "the network is slow." The supervisor then asks: "How many people called in? Just one? Okay then, ignore him and run the usual troubleshooting routines."

And of course, so many troubleshooting tickets come in every day that they can't begin to cover all of them.

This is the point where the conversation with the end-user becomes: "Have you turned your computer off and on?"

A few minutes later, the ticketing system alerts the supervisor that there have been several trouble tickets with the same complaint from the same regional office.

Hmmm...

The supervisor then notifies the help desk manager, who then runs reports to see if this is historically predictable behavior for the third Wednesday in July of a non-leap year, with Venus approaching Jupiter and Mars in retrograde...

Once the help desk manager shakes his magic 8-ball and determines that it's not normal behavior, he gets the Network Operations Center involved. The NOC wants information: number of users, symptoms, troubleshooting steps tried, so a second call back to the user is required, which prompts another round of "Have you turned your computer off and on?"

Once the proper troubleshooting-time has elapsed, they pass the ticket along to the network engineers. Now, the network engineering team, that was working on new engineering projects, (MPLS deployment, router management, VoIP, etc.) that cost hundreds of thousands of dollars are now troubleshooting what seems like a network issue.

After spending some time, they discover that the link is fine, the router is good; and that this is not a network issue. The trouble ticket is passed back to the NOC, who, no closer to what the problem actually is, decide to call on the server team, because they're usually next.

The server team, unable to find the problem, point the fingers back at the network team, and the network team and server teams meet, and scapegoat another team.

You get the idea. Instead of quickly detecting a deviation from normal and addressing it immediately, or better yet proactively managing IT resources, companies wait for a few end-users to call and identify which IT resources flaky.

Imagine a man walking into a hospital, saying that he doesn't feel good, and doctors around the country are immediately called in, starting with the cardiologist, who rules out heart trouble. The man is next wheeled to a podiatrist, who rules out any problems with his feet.He's then wheeled to a gynecologist (But I'm a man... Ma'am, I'm a doctor. I think I should make that determination - and only after the tests come back.) If your diagnostic process is trial by error, you're not, technically, diagnosing.

In any corporation, many end-users know only the programs they need to use for their jobs, not general computer troubleshooting knowledge, let alone the ability to pinpoint issues with IT resources that involve the network, server, applications, etc. Yet, these are the people that the corporations rely on heavily for diagnostics, and if the only answer that can be coaxed out of them is "the network is slow," the solution will probably be upgrading the network link at great expense - even if it turns out the network isn't the cause of the problem.

Scenario II

An end-user calls the help desk and complains that his "Internet is slow." The help desk asks him if anyone else in the office has the issue and if the answer is no, the rebooting dance is performed and if that doesn't solve the problem, the problem is dismissed with a "Well, we'll look into it, thank you."

The next day, the same user calls in with the same problem, and gets the same response. Nothing is done to fix his problem.

By the end of the third day, the employee thinks: "Why am I even trying? The guys in IT don't do anything, and I'm not going waste my time listening to crappy hold music only to be informed that 'we'll look into it.' There comes a time in a person's life that they just can't take any more 'The Girl from Ipanema,' dammit!"

But instead of taking to the streets yelling "Hey Hey, Ho Ho, this Bossa Nova's" got to go, they either find an alternative (and likely harder) solution to get their work done, or just do less work. The result? Disgruntled, unmotivated employees doing sub-standard work. But it's not their fault. They did their due diligence and called in the help desk. The help desk followed standard corporate operating procedure. So did everyone else.

The point is this: The performance of an application, server, network, or any other IT resource should ultimately be judged by the end-user experience. Internal employees are internal customers of your IT resources, and they are just as important as external customers.

What exactly are companies trying to achieve by buying DS3s and putting in a half dozen CPUs and sever gigabytes and terabytes and ninjabytes of hard drive space on servers and other network devices? Simple: They want the best IT resource available so that the employees can do their job efficiently and effectively.

This, in turn, improves business performance, which is tied to revenue. So the faster and better your company can manufacture and ship those widgets, the more satisfied your customer is, which recognizes revenue faster and maintains good customer relations.

And to achieve this good customer relationship with your external customers, the company as a whole must be efficient (unhappy internal customers simply isn't an option.)

Managing IT resources is a critical process that is often neglected - and its not lack of money or human resources that's causing the neglect. Its caused by huge nameless, faceless companies trying to get the most out of their resources without spending a lot on IT troubleshooting. Thats sad, because ultimately, satisfied employees and an optimally performing network will make more money than saving by scrimping on IT.

I cant help but think that these have become the standard business practices because younger companies merely mimic the older companies - who themselves have no idea what to do, but have been in the industry a long time. This is the blind leading the blind.

Solution

So, lets look at another situation. This time, the first person to get notified of a problem is not the end-user, when his console stops performing optimally, but the NOC, where a screen flashes an alert that server X in city Y has degraded performance, giving out complete metrics instead of red and green idiot lights.

The NOC clicks on the alert and is directed to a central event management portal. Here, with the help of different modules, such as application and server performance tools, Netflow metrics for capacity planning and device management software, they have a wide variety of options.

The help desk or NOC technician then clicks on the device management module and sees the device is still running and is not maxing out CPU or memory which means the data center team and the server management team can stay out of this. Then the tech clicks on the application and performance monitoring tool and notices the server response time is very high, but by checking the Netflow, the tech determines the link is not heavily utilized - which means the network team can stay home. Immediately, the application team is alerted to this issue because neither the network nor server seems to be causing the problem, and the application seems to be the bottleneck.

There are a lot of permutations and combinations that this issue can take, of course, but the idea is that its always best for the NOC or help desk technician to call out the right people for the right job at the right time.

Instead of getting involved hours after the incident or call, the application team is on it within the hour if not minutes later. (Heck, if you install one of those fire-house poles, you could probably get them on it in seconds...) Meanwhile, the network engineering and server teams haven't been distracted from their other projects chasing down phantom network/server issues.

The end-user, who is (usually) the most ill-equipped person in the IT chain to determine the root cause of this crisis, is not being asked to provide a detailed troubleshooting analysis, there is no need for complex astrology in determining whether a situation is normal, and the girl from Ipanema just goes walking...

----

Manish Chacko is the manager of systems engineering at NetQoS.




TrackBack

TrackBack URL for this entry:
http://www.netqos.com/MT/mt-tb.cgi/19