Websites tend to have a love-hate relationship with Slashdot.org. A link on Slashdot can send a world of traffic to your doorstep. It can also send so much traffic to your doorstep that your servers melt into piles of tapioca pudding.
It is with a sense of irony that Slashdot slashdotted itself.
Uriah Welcome, Sourceforge’s chief network engineer, wrote about the experiences in a Slashdot post. His explanation is written in the dialect of GeekSpeak known as Engineeresque, so what follows is a paraphrase – the original text can be found on Slashdot.
At around 9pm, Welcome found that there were problems connecting to the site. He tried to log in remotely, but had difficulty due to the network problems. When he finally succeeded in logging in, he found that there was a massive amount of traffic, saturating 40Gbit/sec lines.
The incoming ports showed very little traffic so he ruled out an external cause. He then looked at the internal switch ports. From the logs he was able to find out that many of the core switches were at 100% CPU utilization, and that the message had something to do with multicast. Rebooting the cores did not resolve the CPU utilization problem.
Eventually the problem was solved by process of elimination, turning off cabinet switches one by one until he was able to isolate the problem to a pair of switches. Shutting the downlink ports of those switches off relieved the problem, at about 10 minutes after 10pm.
In a comment on Slashdot, Maz2331 gave some really good advice about why network engineers have to be in the data center:
It may be strange for those not in the networking field, but when things really go bad, the only place to be is physically in the data center.
That means looking at the LEDs on switches for traffic indications. If you see a single port is spewing a LOT of activity during an outage, disconnect it. No, don't make it "down" but pull the cable out of the port. Then go downstream and repeat until the potential problem set is reduced to an understandable level.
What really sucks about these kind of outages is that you can't remotely log in to various hosts or switches - you have to pull wires out of ports to break the "spew" that is taking things down.
Well, obviously, you can remotely log-in to fix the problem, as Mr. Welcome did. You’re just fighting against the same problem just trying to log in that you’re trying to fix by logging in. It’s not a “can-opener-in-a-can” situation, but its close.
The comments on the story are actually quite insightful, and are worth checking out – one user talked about his problems implementing a VoIP phone system broke the entire network in the factory. Of course, another user criticized “armchair networking”.
