Broadband Outage 20/01/2023 – Structured Communications NOC

We are currently aware of a network issue effecting broadband customers, engineers are already on site in preparation for the works at Telehouse and are currently working on the issue. More updates to follow.

Update 20/01/203 22:02

Engineers are still working to find the root cause of the issue, we will post more updates as they become available.

Update 20/01/203 23:04

We can see connections have now come back online, engineers are still working on the issue and will provide an update shortly.

Update 20/01/203 23:22

If you are still without service please power down your router for at least 30 minutes, this should restore your service.

Update 21/01/2023 10:10

Anyone without service please reboot or power down your router for 20 minutes. We are sorry for the issues caused and a further update will be posted with details as to the cause shortly.

Summery 16/02/2023:

A full RFO has been sent to partners and wholesale customers.

On the night of the 20/01/2023 (during the 2nd planned Telehouse power works) Engineers were on site prepping for the power works and to be there should issues arise. We took the opportunity to proactively replace a PDU bar which was showing signs of having a failing management interface. This PDU was on the “FEED A” (The side Telehouse were working on) so no additional deemed risk was expected.

Power feed A was isolated and taken down shortly before the power works
where due to start and the PDU replaced but not re-powered due to the pending
works by Telehouse.

All platforms were operating as expected on a single power feeds.

Additionally, a planned line card replacement was due to take place which
involved moving DSL subscribers across the network in a controlled manor. The
affected LNS01 and LNS03 where isolated and subscribers were moved across. The isolated LNSs were bought back into service shortly after.

At this point we noticed that new inbound DSL connections where only being routed to LNS02 and LNS04. The migrated configuration was checked and confirmed to be as expected.

At this point LNS02 started to reboot uncontrollably which dropped all connected DSL subscribers in an uncontrolled manor. LNS02 was manually rebooted and returned to service but quickly started to reboot again. LNS02 was taken out of service and powered down.

Services from LNS02 did not reconnect so changes where rolled back on the line card migration however this did not make any difference.

Diagnostics on our side did not show the incoming RADIUS proxy requests from our layer 2 provider so we placed a call to there NOC who failed to confirm anything was wrong despite several calls. (This has now been confirmed and was the root cause for the extended outage)

LNS02 was powered backup and diagnostics showed the 12V power rail on the remaining power supply was low and causing the device to reload, however due to the quick reload times on these devices, it was not being flagged on SNMP and due to a combined voltage when both PSUs where energised it did not show as low prior to the event. Power was then swapped over to the other working power supply that was offline due to the power works. This resulted in a stable device.

LNS02 was then bought back into service however no DSL circuits where being routed to us.

Further investigations were taking place when a large volume of inbound DSL connections started to be seen authenticating.

Since the events took place, our wholesale DSL provider confirmed they experienced a Major outage on one of the access routers we are connected to however failed to advise us at the time until many hours after the events took place. A formal complaint has been raised and a RFO has since been provided to confirm a number of devices there side suffered issues and have since been replaced.

While there was a failure of one of our gateways, these are in redundant pairs and would not have caused a complete outage by itself. The events that took place further upstream with our wholesale provider where the root cause of the extended outage.

This was unfortunate timing and had we been advised of the issues, we would have been able to address the outage in another way. We do apologise for the issues seen.