Due to a unexpected reload of a broadband gateway on our network this afternoon, we are seeing a traffic imbalance on part of our broadband network.
We will be taking steps to disconnect sessions gratefully and re balance the affected gateways.
End users will see a PPP reconnect taking approximately 5-20 seconds. In the rare event your connection does not restore. You will need to power OFF and ON your router.
UPDATE01 – 22:10
This work is now complete.
We will be making some changes to our broadband network tonight in order to isolate 2 upstream gateways we suspect of causing additional latency to circuits routed via them.
This will cause existing connection via these gateways to drop and reconnect. Due to the nature of the change this can take up-to 20 minutes.
UPDATE 01 – 23:03
This work is about to start.
UPDATE 02 – 23:06
Tunnels have been terminated and traffic is starting to move across to other gateways.
UPDATE 03 – 23:28
We have seen an issue with the L2TP control messages not being accepted by the upstream gateways and releasing circuits to other gateways. We have therefore had to revert part of the configuration. Further works will be required at a later date.
Our network monitoring has flagged high CPU usage on LNS01 that is starting to affect its operation. We are undertaking an emergency reboot to prevent it crashing completely. This will drop active sessions and force them on to other gateways.
UPDATE01 – 21:35
The reboot is complete. Services transferred to redundant gateways as expected. LNS01 is now back in operation. If you do not have service please power down your router for 20 minutes.
We apologise for any inconvenience caused.
We are aware 1 of our media gateways is not releasing channels once a call has cleared down. This is causing BUSY tones or limit exceeded messages
We are currently working to resolve this ASAP.
UPDATE 01 – 19:00
Emergency Works have started. Any active calls on SIP02 have been dropped. We are sorry for any inconvenience caused.
UPDATE 02 – 19:03
SIP02 has reloaded and all services have restored. We will now look at SIP01
UPDATE 03 – 19:05
Emergency Works have started. Any active calls on SIP01 have been dropped. We are sorry for any inconvenience caused.
UPDATE 04 – 19:03
SIP01 has reloaded and all services have restored.
At the request of our vendor to address a bug which has been resulting in delayed RADIUS ACCOUNTING data we will be upgrading to a firmware that should address this problem. Each firmware upgrade will take no longer than 10 seconds, however a reload of each LNS is required. They will be done 1 at a time and will result in a PPP drop for each DSL circuit. This should automatically re-establish within 60 seconds.
This work is complete. Any user without a working internet connection are advised to power off there hardware for 20 minutes.
We have been advised by one of our fibre wave providers that they will be conducting some emergency maintenance on some fibre interconnects which includes a circuit used by ourselves.
To avoid any unexpected issues we will manually re-route traffic and flag that link as unavailable. Some PPP sessions may drop and re-establish for DSL circuits using this link.
UPDATE 01 – 22:23
Traffic has been re-routed. We where able to maintain reachability to our L2TP endpoints, as such no PPP session drops where seen. The link will be bought back in to service once we have received a clear from that wave provider.
Following on from recent repeated hardware failures on core03.structuredcommunications.co.uk as detailed HERE the decision has been taken to fully replace the device. We will also be taking the opportunity to upgrade the IOS image to bring this device in line with the current images across the rest of our network.
This work will involve powering down and physically moving the current device along with all installed line cards. Due to this all directly connected services (listed below) will be unavailable for the duration of the works.
> Bonded DSL on AG1 & AG2
> un-managed SIP trunking provided via sipwise.easyipt.co.uk
> Managed VoIP services provided via primary-sw.r03.core03 & primary-sw.r04.core03
> Webhosting via server01.easyhttp.co.uk
> VPS sessions on esxi10.r02.structuredcommunications.co.uk
Other services will remain unaffected. Redundant services provided via other parts of the network (Such as DNS & SMTP) will take over. Please ensure you configuration is up to date.
UPDATE01 – 20:10 – 24/08/2016 Engineers are on site and these works have started.
UPDATE02 – 22:06 – 24/08/2016 Engineers have completed the above works ahead of schedule and we can confirm all services have returned to normal. We apologize for the inconvenience caused.
We will continue to monitor the new device to ensure continued operation.
Our network monitoring has alerted us to a fault on CORE03 within our Goswell Road network. This fault is a re-occurrence of an issue identified yesterday that was resolved without impact. Additional logging was added at the time to further assist should it be required.
The issue has been tracked down to the “Ethernet Out of Band Channel” (EOBC) control channel on the devices back plane.
Due to the number of line cards automatically taken out of service by device, we are currently investigating to see if this is part of a common hardware fault such as the current active supervisor module.
We diversely route our internal backhull fibre up-links across each core to insure that a single line card failure does not result in an outage. This is currently in operation however we have lost a number of links due to the fault and the device is classed as at risk along with any directly connected equipment.
We are currently reviewing the logs and will update with further information / action plan asap.
UPDATE01 – 09:49
After reviewing the logs we have concluded the next action step is to swap between active supervisors within that device. This will cause a brief outage to all services connected to that device. We will monitor the device closely after the change to see if the same issue occurs. This reload has been scheduled for 10:00 today.
UPDATE02 – 10:08
The swap completed as expected, however despite this supervisor showing OK and passing diagnostics, it failed to fully take the system load and was reverted back. We suspect this is now a possible backplane issue on this device. Further updates to follow.
UPDATE03 – 14:08
Further observations have been made and the log files reviewed at depth. At this stage we can advise the backup supervisor within CORE03 has been reporting errors however the Cisco IOS listed these as “Non-fatal” and as such have not been flagged up within our monitoring platform.
We suspect a fault had occurred on the standby supervisor which had not been picked up on by the devices internal diagnostics until we bought the card fully in to operation. This fault we suspect was having an impact on the EOBC reporting and thus causing line cards to be disabled. As the previous fault took 24 hours to resurface we are continuing to monitor. An emergency maintenance window is also going to be scheduled for CORE03 to replace the suspected failed card, along with an IOS update.
UPDATE04 – 14:30 – 22/08/2016
Despite seeing the device operate for over 24 hours without further errors, we have just observed the fault conditions triggering line cards to be disabled. We therefore now suspect this a problem with the chassis its-self and the backplane. We will now be replacing the entire device as a matter of course to prevent this escalating to an outage. Further works will be scheduled and notified via the NOC as we dont have a pre-built device on site.
We have identified a security bug within the core firmware installed and running on the following access switches within our network:
Due to the nature of this security bug we have had little option but to immediately update these devices. This update required the above switches to be reloaded so the new IOS could be loaded. Services with a redundant network links to other parts of our network would have seen no disruption.
Other switches are unaffected and this work is now complete.
We apologise for any inconvenience caused.
We have taken action to reload our primary soft switch to clear down some stuck SIP sessions. We are working with our software vendor to try an automate this process without the need for a complete reload of the singling platform in future.
We have been advised that an update will help limit the need to do this, with a further one planned to resolve this completely.
Calls are routing correctly