CoC Network Outage | Technology Services Organization

At approximately 4:25 PM, Tuesday, November 18, 2008, the College experienced a CoC-wide internal network outage. This affects all users of the CoC network, including faculty, staff, and students. TSO is currently working to restore network services. Additional information will be posted as it becomes available.

Update: TSO has determined the cause of the outage to be internal to the CoC network. The source of the problem seems to be traffic originating from a CERCS switch in the HPCF. As of 8:30 PM, November 18, we are working to identify the offending hosts and the specific type of network traffic.

Update: TSO restored normal network functionality to all services except those listed below at 11:00 PM, November 18. We are currently conducting forensics and addressing the affected services listed below.

Affected servers/services:
spanning tree issues affecting the 29 vlan of the CERCS network switch
the rohan cluster
nodes 1-14 and 43-56 of the warp cluster
stale NFS mounts on various file servers

Update: TSO has determined that there was a problem on the Router and Firewall that became apparent on Tuesday, November 18. The issue, which started again Wednesday night, was causing the Firewall to reach 80 to 99% CPU utilization. This would affect connectivity across the network. This increase was indirectly caused by the power maintenance on Sunday, November 16. The HPCF clusters, systems in Klaus and in TSRB were left running during the maintenance, as it did not directly affect them. A significant portion of these systems had been left to the default yp (also known as NIS) configuration of ?broadcast?. This meant they would send out an arp/rpc broadcast request to their entire subnet, in this case the 117 subnet. In order to do this they had to reach the 130.207.117.1 gateway on their network. The 117.1 interface is on the opposite side of the firewall from the 117.0 subnet. This was not a problem before as the systems typically are not brought up simultaneously and so the actual broadcast request was short and completed before the next box would come up to request info. This time, all the systems set for broadcast requested the arp/rpc information at the same time, overwhelming the firewall as it attempted to do packet inspection on the multitude of requests.
At approximately 11:00 AM, November 20, the broadcast was blocked on the 117-firewall to prevent any further impact to the firewall and the rest of the network. The firewall utilization now remains steady at below 40% utilization. TSO is correcting the configuration on these systems such that they do not broadcast, but only connect to arrakis-117 and usul-117 directly, avoiding the firewall issue altogether.

Update: Since approximately 7:00 AM, November 21, the College of Computing has been experiencing network latency that is impacting all administrative, research, and academic services. TSO is analyzing the network traffic to determine the offending hosts and/or services that may be causing the problem. Updates will be posted as they become available.

Update: As of 1:30 PM, November 21, TSO has determined that the latency is not the result of a host or service nor is it occurring at the physical layer of the network. We are currently proofing network configuration files. Additional information will be posted as it becomes available.

Update: The detailed timeline for this outage is shown below:

On Tuesday, November 18, the College's main router was experiencing network latency. While troubleshooting the issue, the main CoC router crashed, and its configuration file was lost on reboot. After reboot, the switch management network was spanning-tree flapping and was disabled on the CERCS core switch to restore switch management capability and VPN services for the remainder of the CoC.

On Wednesday, November 19, network slowness continued with the firewall service module at 80% CPU utilization with the .117 firewall context consuming 60% of the total utilization. Router CPU utilization reached 99%. It was determined that the Advance Management Module (AMM) on CERCS warp cluster 1-14 on the .117 VLAN was the cause, and the AMM was reseated. Router CPU utilization was reduced to less than 20%. However, there was no change to the firewall service module CPU utilization.

On Thursday, November 20, OIT blocked sunrpc on the .117 VLAN. As a result, the .117 firewall context CPU utilization dropped to 30%, and the firewall service module CPU utilization dropped to 60%. We then shut down all but 10 systems on the .117 VLAN with no change in CPU utilization. The main CoC router CPU utilization continued to be less than 20%. The .117 firewall context CPU utilization remained at 30%, so it was moved to a second firewall service module to distribute the load.

On Friday, November 21, the main CoC router was unresponsive, at 99% CPU. TSO called OIT. Their logs showed spanning tree flapping with the .117 VLAN. We then verified the core router and switch configurations. We also verified all interconnections on all uplinks to the entire CoC network. OIT and GTISC came onsite to assist TSO. The failure point was determined to be a built-in switch on the CERCS awing blade center. After removing the switch, the firewall CPU returned to normal levels (<1%).

Update: During the network outage, we determined that some email was rejected by our mail server. The users affected have been notified of the messages that were rejected including the date, time stamp and sender of the messages.

Owner of Alert

TSO