Our network monitoring has alerted us to a fault on CORE03 within our Goswell Road network. This fault is a re-occurrence of an issue identified yesterday that was resolved without impact. Additional logging was added at the time to further assist should it be required.
The issue has been tracked down to the “Ethernet Out of Band Channel” (EOBC) control channel on the devices back plane.
Due to the number of line cards automatically taken out of service by device, we are currently investigating to see if this is part of a common hardware fault such as the current active supervisor module.
We diversely route our internal backhull fibre up-links across each core to insure that a single line card failure does not result in an outage. This is currently in operation however we have lost a number of links due to the fault and the device is classed as at risk along with any directly connected equipment.
We are currently reviewing the logs and will update with further information / action plan asap.
UPDATE01 – 09:49
After reviewing the logs we have concluded the next action step is to swap between active supervisors within that device. This will cause a brief outage to all services connected to that device. We will monitor the device closely after the change to see if the same issue occurs. This reload has been scheduled for 10:00 today.
UPDATE02 – 10:08
The swap completed as expected, however despite this supervisor showing OK and passing diagnostics, it failed to fully take the system load and was reverted back. We suspect this is now a possible backplane issue on this device. Further updates to follow.
UPDATE03 – 14:08
Further observations have been made and the log files reviewed at depth. At this stage we can advise the backup supervisor within CORE03 has been reporting errors however the Cisco IOS listed these as “Non-fatal” and as such have not been flagged up within our monitoring platform.
We suspect a fault had occurred on the standby supervisor which had not been picked up on by the devices internal diagnostics until we bought the card fully in to operation. This fault we suspect was having an impact on the EOBC reporting and thus causing line cards to be disabled. As the previous fault took 24 hours to resurface we are continuing to monitor. An emergency maintenance window is also going to be scheduled for CORE03 to replace the suspected failed card, along with an IOS update.
UPDATE04 – 14:30 – 22/08/2016
Despite seeing the device operate for over 24 hours without further errors, we have just observed the fault conditions triggering line cards to be disabled. We therefore now suspect this a problem with the chassis its-self and the backplane. We will now be replacing the entire device as a matter of course to prevent this escalating to an outage. Further works will be scheduled and notified via the NOC as we dont have a pre-built device on site.