Zach Belcher, Technical Consultant and Product Manager for NetQoS/CA, discusses how to detect performance degradation early and use automated workflows to help find and fix problems before the end user notices.
Application Performance Archives
Application Performance Archives
Brian Bakstran, VP of Product Marketing at our parent company, CA, recently blogged about a study from Network Instruments which talks about how 59% of IT organizations “lack the experience to manage virtualized environments effectively.”
Combined with the idea that by 2012, 80% of all new servers will be virtual ones, and you start to get this sinking feeling that the entire IT industry knows where it’s going, but hasn’t really thought about what it needs to do once it gets there… sort of like sitting in the first four rows at Sea World, all excited to see Shamu, but forgetting to pack a poncho.
And so vendors like us and our parent company offer that visibility. (In the case of CA, for right now, we’re offering it in spades, with the NetQoS stuff [PDF] and the e-Health stuff and CA Virtualization Management.)
The main concern that the lack of visibility presents to enterprise IT shops is the idea that mission critical applications that performed fine before virtualization may perform poorly when virtualized, and the IT shop will have no way of being proactive in finding performance problems, nor will they have the tools they need to quickly find the root cause of the problem.
And visibility is necessary even before virtualization to compare performance to the non-virtualized baseline. There are some applications that simply will always perform poorly in virtualization, and the sooner those applications are discovered, the better. Knowing what does and does not work in virtualized environments gives you options – you can replace the app, run the app on a dedicated server, or even recode the app to work better in virtualized environments. But without visibility, you have no options.
Between the reduction in energy consumption and the better utilization of existing servers, the benefits of virtualization are worth the risk, but there’s nothing that says that you can’t bring in everything you can to get visibility into your virtualized servers and mitigate the risk.
Application Performance Archives
In May of this year, Nemertes Research president Johna Till Johnson wrote in Network World that “The Internet Sky Really Is Falling.”
The next day, we came out with a story about that column, in our much more irreverent style, entitled “That’s great, it starts with an earthquake: Is the Internet dying?”
In that article, we questioned the conclusions that they drew from evidence. To sum up, those conclusions were:
Nemertes believed that YouTube restricting high definition video to developing countries was a sign of the Internet outstripping backbone demand. We pointed out that such restrictions were due to local traffic problems and the lack of profitable business models in many developing markets.
Nemertes also pointed out that many cable carriers were instituting bandwidth caps and pay-per-byte pricing. We pointed out that we did an entire series on why usage caps don’t help with traffic congestion, and that ISPs that roll them out typically do so in generally non-competitive markets where they have other business interests (like cable TV and phone service) that compete with Internet access, and that there were plenty of counter-examples of companies (like Verizon and Cablevision) offering more bandwidth without caps.
And Nemertes pointed out the IPv4 shortage, for which there was already a solution, IPv6. (Though adoption rates have been slow, it does not mean the Internet will halt – simply that IPv6 changeovers will be more expensive the longer the delay.)
But the one thing we didn’t question was claim by Nemertes claims that Internet traffic will grow “exponentially” while Internet backbone will grow “linearly,” leading Nemertes to the conclusion that there will come a day when there will be Internet “brownouts.”
Recently, Johna Till Johnson published another column – this time in ComputerWorld, outright claiming that net neutrality legislation would mean the end of the Internet. That’s not hyperbole on my part – the headline is literally: “Hello net neutrality, goodbye Internet.”
And Ars Technica, a Conde Nast publication, decided to take another look at Nemertes’ evidence.
Essentially, Nemertes now claims (in the October article) that Internet growth creates a strain on last-mile access lines (Cable/DSL/FiOS) that makes it “excruciatingly expensive to upgrade,” that network neutrality would mean that you can’t charge different rates for different traffic, so backbone providers and carriers would start charging by the bit – or at least capping and charging for overages. Since bandwidth providers would now charge each other for the traffic on their networks, they would either raise subscriber rates dramatically or disconnect from the Internet entirely, literally killing the Internet as the entire thing breaks down into walled tiers like early 1990s Compuserve, AOL, & Prodigy.
Ars Technica, on the other hand, points out that the “excruciatingly expensive to upgrade” last-mile bandwidth isn’t exactly excruciatingly expensive compared to the profits that Internet service providers already generate with net neutrality and in most cases, without caps. Verizon, for example, is paying $18 billion for FiOS upgrades, but that’s the most expensive upgrade in the market, and Verizon finds it financially feasible to do so in a net-neutral market. For most ISPs, DOCSIS 3.0 (for Cable) and FTTN (for DSL) are very cheap solutions to increasing last-mile bandwidth.
As for the idea of the Internet fracturing, Ars Technica pointed out that ISP networks all exchange roughly the same amount of bandwidth; and an even trade is an even trade no matter how much it costs. There are many ways to recoup costs – but raising the rates on a competitor who can then turn around and raise rates on you doesn’t make any sense at all.
Or as Sevcik and Wetzel put it in Network World:
“Backbone ISPs and access ISPs must play nicely with each other to satisfy their customers' needs. Why for heaven's sake would they hurt their customers and themselves by balkanizing?”
What’s most worrying however, is that Ars Technica wrote that Nemertes idea of Internet growth outstripping capacity may be flawed.
According to the University of Minnesota MINTS project, the year-over-year growth of Internet traffic is not “50-100%” as Nemertes claimed in the ComputerWorld article, but “50-60%.” (Technically, “50-60%” is within the range of “50-100%” but it’s like estimating that a man that could be 5to 6 feet tall is “between 5 to 10 ft. tall.”) In Canada, where ISPs have to reveal traffic numbers due to network neutrality research by the Canadian government, they find that growth is slowing, year over year. 53% growth in 2006, but 32% growth in 2008.
We’ve found that when it comes to enterprise networks and IT in general, Nemertes Research is a valuable research organization. But in 2007, Nemertes made a prediction – a reasonable one, given the evidence at the time - about the Internet that did not come to pass. Instead of re-examining that prediction, they continue to insist – on openly contested arguments – that they were indeed right all along, even as, less than 3 months away from the ominous “2010” date, the Internet has managed to keep up with the demand of high-bandwidth YouTube HD files, NetFlix streaming, Skype Video-calling, video game downloads, and other high-throughput applications.
I think that what is actually happening, rather than demand for bandwidth outstripping supply, is that the supply of bandwidth creates its own demand, and that the new demand comes primarily from new applications. That is, HD video on the net is only in demand now that the networks have been shown to be able to handle that kind of capacity. YouTube didn’t start until there was enough capacity on the Internet to make SD video distribution feasible. Only later, when capacity grew, did YouTube roll out high quality video, and still later (after Vimeo proved it was feasible) did YouTube roll out 720p video content. When the network capacity can handle streaming 1080p video, then that will be the new standard. But no one is going to roll out 1080p video until the network can handle it.
This is not to be confused with the issues faced by enterprises when trying to allocate resources to business critical traffic over recreational traffic – where supply of recreational network traffic can be artificially restricted through QoS policies and traffic shaping in order to, presumably, lower the strain that recreational traffic puts on the network. Even so, most smart companies engage in capacity planning, making sure they have the bandwidth available to use new applications before those applications are rolled out. Teleconferencing, for example, is a business application that requires a great deal of bandwidth – but it’s of no use to an organization – and therefore not demanded – if the company network can’t support it. Or in other words, if the money saved from teleconferencing isn’t equal to or greater than the increase in network costs, smart companies are not likely to invest in teleconferencing.
In short, the sky is not falling. But keep an eye on your patch of it, anyway.
Application Performance Archives
Virtualization is a good news/bad news technology. The good news is that you can consolidate your servers onto one piece of hardware, but the bad news is that you lose visibility into the overall network. Jim Metzler, of Ashton, Metzler & Associates, and Ben Erwin of NetQoS discuss how to preserve visibility into application delivery in this short Whiteboard Series Video
Application Performance Archives
WAN Optimization solutions – assuming that they work for the applications you need them to work for – are like magic. Consolidating data centers, from a relativistic standpoint, actually moves users further away, so to consolidate data centers, and lowering costs, WAN performance needs to be good enough for the remote users to do their jobs.
But the irony is that as data centers are becoming more consolidated, users are becoming less consolidated. More people are telecommuting than ever before. (Even if the number of full-time telecommuters has gone down, part-time telecommuters rise). It makes a certain amount of sense – an employee too sick to come into work (and infect others) but not too sick to actually work might file some work from home, or sales teams might file reports from the road.
This creates a problem for most WAN Optimization solutions because most solutions require appliances at both ends of the WAN link. Telecommuters are usually accessing the applications from the public Internet. Software-based WAN optimization controllers (“Soft WOCs”) can do some of the work, but telecommuting requires high-performing broadband as well as optimization solutions.
The way that Soft WOCs work, is essentially to recreate a lightweight version of the client that normally sits at the remote end of the optimized WAN link in the software on the mobile computer. The Soft WOC then optimizes the stream between the telecommuter’s computer and the data center.
The problem is that WAN optimization is less efficient when you have a single user than when you have multiple users on the same stream. First, having multiple users accessing the same data means you can take advantage of caching. Caching is only useful on a Soft WOC link if the same user accesses the same data twice.
Secondly, in a normal optimized WAN link, there is only one TCP stream to worry about – the optimized one, with individual streams recreated only at the two ends of the transaction. Each SoftWOC essentially creates its own stream. For that reason, telecommuting solutions simply aren’t going to give you the same dramatic increase in performance you’d get from more traditional WAN Optimization.
On the other hand, any improvement is still improvement. Just be sure to baseline your performance and see if the value is there before deploying Soft WOC solutions.
Application Performance Archives
There’s an article on ZDNet talking about a video where Sun Microsystems CTO Lew Tucker talks about how future cloud computing applications will be able to know exactly how much demand there is for the application, and requisition the appropriate amount of computing power. During high demand, the application could grab more resources, preventing application-based slowdowns, and during low demand, the application could release resources back into the cloud, saving the company money.
Of course, ZDNet’s title for the article is “Future Cloud Apps won’t need humans” which conjures up frightening images.
If it’s any indication, dynamic allocation of the needs of information will cause anxious consternation about the continued necessitation of the IT occupation, and frantic desperation. (Of course, that’s just idle speculation.)
But it might be more accurate to suggest that “Future Cloud Apps won’t need humans to babysit them.” That is – all that Tucker talks about is the idea of taking what used to be a manual process – deciding how much processing power any particular application needs – and having the computer make that determination on the fly based on the actual processing power needs. Certainly, humans will be involved in determining how much power is “too much,” how much slowdown is “acceptable,” and – most importantly – how much performance that the end-users can actually use.
This has two main impacts on the networking side of IT – that is, if an application can dynamically allocate more resources during times of excess need, application performance may be limited on the server or on the network, but it eliminates one of the main causes of application performance problems – not assigning enough resources to the application.
Additionally, application performance becomes important independent of the network, as a poorly coded application might need more resources and therefore require more money to operate.
Secondly, when you essentially remove the limits on application performance by simply allowing it enough resources to do the job at any time of the day, you have to continue to look for other bottlenecks. If you have the capacity to do more with what you’ve got, it makes sense to do everything you can to take advantage of that capacity.
Now, before this possibility becomes a reality, cloud computing standards need to be developed, agreed upon, and used in order to have multiple applications cooperate in any dynamically scaling environment. That may be very soon, or a long way off, but it will probably happen, because there’s just too much money to be missed out on if there isn’t a cloud computing interoperability standard.
Application Performance Archives
The Dow Jones Newswires report that Google will acquire On2 Technologies, a company that makes video compression, for $106.5M worth of stock, presumably for the video site YouTube.
It’s an awfully big investment - (a hefty 6.7x multiple of On2’s trailing twelve months (TTM) revenue, one of the highest multiples in tech over the last 18 months) - for a site which is perpetually the butt of jokes about not being able to turn a profit. But there are a number of reasons it might be a smart move.
At it’s core, Google has always been about using the power of computing to make information searchable and organized. Video has a major limitation – unlike text, you cannot search by keyword, only by ‘tags’ – self-reported information – or by context, in this case, “links.” If 12 people link to a video with the word “Tango,” for example, then chances are the video is about tango in some shape or form.
But Google is pretty good about finding ways around these limitations. Google 411 was a free service that had a secondary function – it allowed for Google to improve its voice recognition algorithms to the point where it could offer Google Voice. And if it can offer Google Voice, which automatically transcribes audio voicemail messages into searchable text, it’s not that much of a leap to transcribe the audio track of uploaded video into searchable messages. That makes video more attractive to advertisers.
Where On2 fits into this is that On2 offers a video codec, called VP6, which is compatible with Flash video and provides roughly the same quality as the current standard, H.264, at the same bitrates (filesizes). However, the processing power needed to decode (play) the VP6 codec is significantly less than the processing power needed to decode the H.264 codec.
Obviously, this is an advantage for Google, who is producing its own “Google OS” for use with low-powered netbooks. Plus, there’s an awful lot of slow computers out there that are still in use.
But less obviously – and this is a guess – because VP6 takes less processing power to decode, complex complications – like trying to do voice recognition – can be done faster when decoding thousands of VP6 files at once, compared to thousands of H.264 files at once. Even if the difference is on the order of microseconds per video, when you’re talking about the millions of videos on YouTube, those little microseconds add up quickly.
Perhaps Google is losing money, but it may be because they're creating, essentially, a new application, and trying to get the best performance for it before trying to market it, and increases in application performance can often offset hardware costs, power requirements, or bandwidth needs.
Application Performance Archives
NetworksFirst.com has recently created an online “Impact of Network Downtime” calculator, which you can use to estimate how much money it would cost if your network went down. It makes a compelling case for fault management and worrying about outages.
However, the cost of poor application performance is harder to quantify – or at least, requires more sophisticated tools and data - than the cost of fault. That may be the reason that many companies still consider fault management, and not performance management, to be the core responsibility of the IT team. Our most recent research conducted with Ashton, Metzler & Associates bears this out:
Fifty percent of respondents indicated that they measure and report on the mean time to repair (MTTR) for a network or application outage. However, only thirty percent confirmed they actually measure and report on the MTTR for degraded application performance, revealing a continuing legacy of fault and availability management over performance management.
As technology has improved, fault performance problems have, for the most part, been solved. It’s no longer a distinguishing feature for a network service provider to promise 99.999% uptime. The next big challenge is maintaining good performance throughout the network.
But in many ways, it’s a hard sell, because unlike a fault cost calculator, it’s difficult to show you exactly why you need performance management tools until you have the more nuanced calculation of what poor performance costs your business. What’s the difference in employee productivity when an application is 10% slower, 50% slower?
These types of metrics have typically been calculated for customer facing applications like Web retailers, but getting the data for internal IT users has been far less popular since it’s considered a soft cost in some arenas. But it really starts to add up if you pay attention.
One NetQoS customer said their typical critical business application “brownout” (before deploying NetQoS products) cost them $6000 per hour and they had about 20 of these per year, each taking about six hours to isolate and resolve. That’s $720k gone per year due to poor application performance ($6k * 6 hours * 20 events = $720k/year). True, the brownout costs less per hour than most estimates you see for out-and-out downtime, but they occur a lot more frequently.
It took some investigation and understanding on the customer end to establish the value of different applications, who was using them, and then run the numbers, but now they have some idea of the cost of all of those shades of gray between up and down and this helps them justify their investments in technology and process improvements to reduce the brownouts as well as the blackouts.
This is why vendors, such as ourselves, are willing to come out and have a conversation and demo with your company.
But even so, consider this idea as an inaccurate but useful shorthand in the form of a Zen koan: If the network is so slow that nothing gets done, is it any different than if the network were down all together? And what is the difference between a network down for half a day than a network that takes twice as long to get anything done for a full day?
And if a computer goes down in the woods, but no one receives an error message, did it really have an error at all? And what is the sound of one router crashing?
Application Performance Archives
By Keith Bendy
Business Development Manager, NetQoS
It’s hard to miss the “human network” theme in virtually all of Cisco’s recent commercials. They are clearly advocating a lot of converged network capabilities – voice, video, and other interpersonal communication or information methods.
It makes sense – video and voice are bandwidth heavy applications, and it’s a logical growth area for Cisco if they can provide more information about video and voice traffic. The challenge, however, is that despite all the video products they’ve brought into the market, (from Telepresence to the acquisition of Flip), there aren’t a lot of robust capabilities built into the products in order to troubleshoot performance.
Medianet is one of the largest initiative in Cisco’s history, and it’s focused on bringing those exact troubleshooting capabilities to the market. The objective is to integrate media traffic reporting into Cisco products and IOS, and get the ability to really understand what performance is for video and voice traffic. And in addition to troubleshooting, even having the ability to have the infrastructure react to changes in performance (i.e., “Autoprovisioning”) is really what the overall goal is for MediaNet.
MediaNet is just starting up, but Cisco is addressing a need that is very real, so I anticipate that its adoption will be high. Cisco may be ahead of the demand curve, but the need is pretty well established.
At a very high level, what's important to MediaNet customers is the ability to understand what performance looks like, find out where the issues are, and then drill in to get the information required to get the issue on the path to resolution. And so, when Cisco wanted to demonstrate the MediaNet capabilities at Cisco Live, they used NetQoS Performance Center because they have a lot of experience working with NetQoS (on products like WAAS, ACE and NAM) and it can take advantage of capabilities that exist today (like NBAR, IPSLA, and Netflow)
With Netflow, the NetQoS Performance Center is able to show how much video is on the network, and use TOS values to determine how the traffic is tagged. We can also see what the end-point IP addresses are. But NBAR provides deeper recognition of the protocols than what Netflow will typically give you. NBAR reports on specific tags for various traffic - instead of saying "This particular TOS queue is all my video traffic, and I don't know what kind of video it is," the NBAR identifiers would say: "This is telepresence traffic, this is security camera traffic, this is WebEx traffic, this is a video-capable phone” - and tag all of it appropriately.
Below is a video, from Cisco’s YouTube page, where Aamer Akhter, Technical Marketing Engineer at Cisco, demos the Cisco Medianet 1.0 network.
Application Performance Archives
One of the things holding back the rollout of new applications (like VoIP, Video, and Unified Communications) is the fear that the new applications will cause network performance problems; according to Network World’s Denise Dubie, citing a survey from Apparent Networks.
Nearly 61% said that they had delayed a VoIP implementation due to network performance concerns. Some 35% postponed a video rollout for the same reasons and 26% put a unified communications project on hold. The survey also showed that network managers can’t always validate their service-level agreements (SLA) with external service providers. More than one-quarter of respondents don’t have the capability to validate SLAs.
It would be instructive to know if decision makers are “concerned” that new apps will reduce their performance because they have baselined performance and know that the network cannot handle new application rollouts… or if they’re concerned because they have no idea whether the network can handle it or not.
It’s the difference between being stopped by practicality and being paralyzed by fear.
And if you’re being paralyzed by fear, it’s costing you money.
For example, Cisco decided to “eat it’s own dogfood” and estimated that they saved $277M from bringing in their own virtual office telecommuting technology – a new application (based on their “Cisco Virtual Office”) for the network that leads to cost savings. If Cisco didn’t know that their network was capable of supporting the CVO application, they would have been out $277M.
Of course, the reason you don’t roll out an application that might save you millions when you don’t know whether those applications will negatively affect network performance is that poor network performance can cost more than whatever you’d save by the rollout.
You can know, or you can be paralyzed by fear of the unknown. I know which I’d rather be.
