LON01 – CORE03 – 24/08/2016 – 20:00 Till 23:59 *Emergency Work *

Following on from recent repeated hardware failures on core03.structuredcommunications.co.uk as detailed HERE the decision has been taken to fully replace the device. We will also be taking the opportunity to upgrade the IOS image to bring this device in line with the current images across the rest of our network. This work will involve powering … Continue reading “LON01 – CORE03 – 24/08/2016 – 20:00 Till 23:59 *Emergency Work *”

Following on from recent repeated hardware failures on core03.structuredcommunications.co.uk as detailed HERE the decision has been taken to fully replace the device. We will also be taking the opportunity to upgrade the IOS image to bring this device in line with the current images across the rest of our network.

This work will involve powering down and physically moving the current device along with all installed line cards. Due to this all directly connected services (listed below) will be unavailable for the duration of the works.

> Bonded DSL on AG1 & AG2

> un-managed SIP trunking provided via sipwise.easyipt.co.uk

> Managed VoIP services provided via primary-sw.r03.core03 & primary-sw.r04.core03

> Webhosting via server01.easyhttp.co.uk

> VPS sessions on esxi10.r02.structuredcommunications.co.uk

Other services will remain unaffected. Redundant services provided via other parts of the network (Such as DNS & SMTP) will take over. Please ensure you configuration is up to date.

UPDATE01 – 20:10 – 24/08/2016 Engineers are on site and these works have started.

UPDATE02 – 22:06 – 24/08/2016 Engineers have completed the above works ahead of schedule and we can confirm all services have returned to normal. We apologize for the inconvenience caused.

We will continue to monitor the new device to ensure continued operation.

LON01 – CORE03 – 21/08/2016 – 08:33 *At Risk*

Our network monitoring has alerted us to a fault on CORE03 within our Goswell Road network. This fault is a re-occurrence of an issue identified yesterday that was resolved without impact. Additional logging was added at the time to further assist should it be required. The issue has been tracked down to the “Ethernet Out … Continue reading “LON01 – CORE03 – 21/08/2016 – 08:33 *At Risk*”

Our network monitoring has alerted us to a fault on CORE03 within our Goswell Road network. This fault is a re-occurrence of an issue identified yesterday that was resolved without impact. Additional logging was added at the time to further assist should it be required.

The issue has been tracked down to the “Ethernet Out of Band Channel” (EOBC) control channel on the devices back plane.

Due to the number of line cards automatically taken out of service by device, we are currently investigating to see if this is part of a common hardware fault such as the current active supervisor module.

We diversely route our internal backhull fibre up-links across each core to insure that a single line card failure does not result in an outage. This is currently in operation however we have lost a number of links due to the fault and the device is classed as at risk along with any directly connected equipment.

We are currently reviewing the logs and will update with further information / action plan asap.

UPDATE01 – 09:49
After reviewing the logs we have concluded the next action step is to swap between active supervisors within that device. This will cause a brief outage to all services connected to that device. We will monitor the device closely after the change to see if the same issue occurs. This reload has been scheduled for 10:00 today.

UPDATE02 – 10:08
The swap completed as expected, however despite this supervisor showing OK and passing diagnostics, it failed to fully take the system load and was reverted back. We suspect this is now a possible backplane issue on this device. Further updates to follow.

UPDATE03 – 14:08
Further observations have been made and the log files reviewed at depth. At this stage we can advise the backup supervisor within CORE03 has been reporting errors however the Cisco IOS listed these as “Non-fatal” and as such have not been flagged up within our monitoring platform.

We suspect a fault had occurred on the standby supervisor which had not been picked up on by the devices internal diagnostics until we bought the card fully in to operation. This fault we suspect was having an impact on the EOBC reporting and thus causing line cards to be disabled. As the previous fault took 24 hours to resurface we are continuing to monitor. An emergency maintenance window is also going to be scheduled for CORE03 to replace the suspected failed card, along with an IOS update.

UPDATE04 – 14:30 – 22/08/2016
Despite seeing the device operate for over 24 hours without further errors, we have just observed the fault conditions triggering line cards to be disabled. We therefore now suspect this a problem with the chassis its-self and the backplane. We will now be replacing the entire device as a matter of course to prevent this escalating to an outage. Further works will be scheduled and notified via the NOC as we dont have a pre-built device on site.

LON01 – Access Switches – 27/06/2016 – 21:12 *COMPLETE*

We have identified a security bug within the core firmware installed and running on the following access switches within our network: primary-sw.r02 backup-sw.r02 primary-sw.r03 backup-sw.r03 primary-sw.r04 backup-sw.r04 Due to the nature of this security bug we have had little option but to immediately update these devices. This update required the above switches to be reloaded … Continue reading “LON01 – Access Switches – 27/06/2016 – 21:12 *COMPLETE*”

We have identified a security bug within the core firmware installed and running on the following access switches within our network:

primary-sw.r02
backup-sw.r02
primary-sw.r03
backup-sw.r03
primary-sw.r04
backup-sw.r04

Due to the nature of this security bug we have had little option but to immediately update these devices. This update required the above switches to be reloaded so the new IOS could be loaded. Services with a redundant network links to other parts of our network would have seen no disruption.

Other switches are unaffected and this work is now complete.

We apologise for any inconvenience caused.

LON01 – EasyIPT – 16/05/2016 – 21:40 till 22:45 *Emergency Maintenance* *Complete*

We have taken action to reload our primary soft switch to clear down some stuck SIP sessions. We are working with our software vendor to try an automate this process without the need for a complete reload of the singling platform in future. We have been advised that an update will help limit the need … Continue reading “LON01 – EasyIPT – 16/05/2016 – 21:40 till 22:45 *Emergency Maintenance* *Complete*”

We have taken action to reload our primary soft switch to clear down some stuck SIP sessions. We are working with our software vendor to try an automate this process without the need for a complete reload of the singling platform in future.

We have been advised that an update will help limit the need to do this, with a further one planned to resolve this completely.

Calls are routing correctly

LON01 – POP-C002.017 – 06/04/2016 – 10:00 – 12:00 – *Emergency Work* *COMPLETE*

Further to our NOC notice posted on the 31/03/2016 in respect to power at one of our POPs “C002.017 – Goswell Road”, Our network monitoring has alerted us to another loss of power on the secondary feed at this cab. Structured engineers are attending site tomorrow morning within the above maintenance window to install equipment … Continue reading “LON01 – POP-C002.017 – 06/04/2016 – 10:00 – 12:00 – *Emergency Work* *COMPLETE*”

Further to our NOC notice posted on the 31/03/2016 in respect to power at one of our POPs “C002.017 – Goswell Road”, Our network monitoring has alerted us to another loss of power on the secondary feed at this cab.

Structured engineers are attending site tomorrow morning within the above maintenance window to install equipment that will allow us to isolate the faulty hardware without further risk to services provided via this cabinet going forward. The work will involve the replacement of various power distribution hardware. At this time the POP is operating on its redundant power feed and all services have been re-routed where possible. Ethernet services provided by this cab are considered “at risk” until power has been fully restored. Transit services will automatically re-route in the event of a failure.

Due to the nature of the works, engineers will be working within a live cabinet. No issues are expected and extreme care will be taken while the works are under-taken.

Further updates will be provided in the morning.

UPDATE01 – 10:10
Engineers have started work.

UPDATE02 – 10:55
The new power distribution hardware has been installed and engineers will begin to power up the affected hardware 1 device at a time. Level3 are on site with us in the evn of another problem.

UPDATE03 – 11:10
All hardware has been powered up and the faulty device found (all be it with a bang and fire) Unfortunately the failed hardware is the redundant power supply on the network core within this rack. Redundant hardware of this size is not kept on site and we are currently in the process of sourcing another unit. Further updates to follow.

UPDATE04 – 12:02
Further test have been done and confirmed the PSU unit has failed. Engineers have removed a power supply from another unit from our 4th floor suite and installed it within the 2nd floor POP to confirm it is un-damaged by the recent events. Further updates to follow.

UPDATE05 – 12:45
Engineers have ordered a same day replacement from one of our suppliers. Engineers are going to remain on site to fit and commission the new hardware on arrival

UPDATE06 – 15:25
Engineers remain on site awaiting hardware. ETA was 15:30, however this has been pushed back due to a crash on the A3.

UPDATE07 – 15:25
Enginners remain on site having little fun. We have been advised the part is now in London and will be with us by 17:30

UPDATE08 – 17.35
The replacement part has arrived on site

UPDATE09 – 17.50
Despite our best efforts, our supplier has shipped the wrong part! Discussions had with them have concluded with no further delivery options today. New hardware has been sourced and is being made avaliable to site for a timed delivery. Engineers are attending again in the morning to swap out the (hopefully) correct new part. We do applogise for the delay in getting this resolved, however want to remind customers who route via this device that it is still operating as expected on its redundant supply.

UPDATE10 – 09:30 – 07/04/2016
Engineers have retured to site and are awaiting delivery of the new PSU.

UPDATE11 – 09:46 – 07/04/2016
Delivery update to advise the hardware will be on site before 11am.

UPDATE12 – 10:33 – 07/04/2016
Hardware has arrived on site and engineers have confirmed it is the correct unit this time.

UPDATE13 – 10:47 – 07/04/2016
Engineers have installed the new power supply and confirmed its operation within the core. A series of load tests have been conducted with normal operation observed.

UPDATE14 – 11:00 – 07/04/2016
We are happy the new power supply is operating as expected, however will continue to monitor its operation for the next few hours. The site is no longer classed as “at risk” and this issue will now be closed off.
We apologise one again for the delay this has taken to resolve and will be reviewing our internal procedures on hardware spares of this nature at Goswell Road.

LON01 – EasyXDSL – 24/03/2016 – 14:30 – *Emergency Maintenance*

We have just been made aware that our carrier in conjunction with BT will be conducting some emergency maintenance on several FTTC gateways within the South East area. This will cause connections to drop, however should reconnect almost instantly. We apologize for the short notice given for these works and will advise once complete.

We have just been made aware that our carrier in conjunction with BT will be conducting some emergency maintenance on several FTTC gateways within the South East area. This will cause connections to drop, however should reconnect almost instantly.

We apologize for the short notice given for these works and will advise once complete.

LON01 – EasyHTTP – 02/03/2016 – 17:30 – *Maintenance* *Complete*

Our network monitoring has alerted us to a memory problem with “Server01” within our EasyHTTP platform. The server is on-line and processing requests, however is becoming more unresponsive to our monitoring systems as time progresses. To avoid a complete failure of the system, the decision has been made to power cycle the server at 17:30 … Continue reading “LON01 – EasyHTTP – 02/03/2016 – 17:30 – *Maintenance* *Complete*”

Our network monitoring has alerted us to a memory problem with “Server01” within our EasyHTTP platform. The server is on-line and processing requests, however is becoming more unresponsive to our monitoring systems as time progresses.

To avoid a complete failure of the system, the decision has been made to power cycle the server at 17:30

SMTP and WEB services on this server will be unavailable for the duration. We will review the system for stability once the reboot is complete and take further action where required.

We apologise for any inconvenience this may cause.

LON01 – EasyIPT – 29/02/2016 – 21:30 till 22:00 *Emergency Maintenance* *Complete*

Following on from reports today of SIP channel limits being reached for both inbound and outbound calls affecting some of our managed PBX systems, work has been undertaken on our core softswitch to try and identify the root cause of this issue. Changes have been made to the platform however a emergency reboot is required. … Continue reading “LON01 – EasyIPT – 29/02/2016 – 21:30 till 22:00 *Emergency Maintenance* *Complete*”

Following on from reports today of SIP channel limits being reached for both inbound and outbound calls affecting some of our managed PBX systems, work has been undertaken on our core softswitch to try and identify the root cause of this issue.

Changes have been made to the platform however a emergency reboot is required. Due to the size of the platform this reboot will take up-to 15 minutes to complete. During this time inbound and outbound calls will be limited.

We will advise once service is restored and calls are routing again.

Apologies for any inconvenience this may cause.

UPDATE01 – 21:40 softswitch reboot is complete, we are monitoring traffic flow

UPDATE02 – 22:55 A random check on managed PBX systems are showing registration, however this is not reflected on our softswitch, traffic is flowing however.

UPDATE03 – 22:10 Systems continue to show inaccurate data about the status of some registrations. We suspect stuck SIP sessions on some managed PBXs as this is affecting a wide range of PBX software versions.

UPDATE04 – 21:40 All affected managed PBXs are showing as on-line correctly.

LON01 – 11/11/2015 – 23:00 till 03:00 – EasyXDSL – Emergency Maintenance *COMPLETE*

We have been advised this afternoon by one of our interconnect providers that that they will be conducting a software upgrade on there core Junos router at Telehouse North between 23:00 and 03:00 tonight. This upgrade work will affect one of our fibre waves we use for DSL termination back in to our network at … Continue reading “LON01 – 11/11/2015 – 23:00 till 03:00 – EasyXDSL – Emergency Maintenance *COMPLETE*”

We have been advised this afternoon by one of our interconnect providers that that they will be conducting a software upgrade on there core Junos router at Telehouse North between 23:00 and 03:00 tonight. This upgrade work will affect one of our fibre waves we use for DSL termination back in to our network at Goswell Road. DSL traffic traversing over this link will drop during the maintenance window, however should re-establish over an alternative link via Telehouse East which is unaffected.

The nature of these works include a reload of the Junos hardware at Telehouse North by our wave supplier and in theory should be the only disruption we see. Unfortunately sort disruptions like this will cause multiple DSL PPP session re-terminations as they flip back and forth between Telehouse North & Telehouse East as our network is designed to use the former where possible. Due this concern and avoiding any other unforeseen issues that will cause session drops during the maintenance windows, we will be manually shutting down the Telehouse North link at 22:50 tonight with the view of bringing it backup at 07:00 tomorrow morning. This will limit the session drops to these times and provide a stable service throughout.

We apologise for any inconvenience this may cause.

UPDATE 01 – 18:45

We have reviewed the original manual switch over time and concluded it does not allow any contingency should this be required. We have therefor updated this to 22:00
UPDATE 02 – 07:10 – 12/11/2015

This work is complete.

LON01 – EasyIPT – 13/07/2014 – 20:00 till 21:00 *Planned Maintenance* *COMPLETE*

We are aware some users have been seeing stuck SIP sessions today, whereby “BYE” messages have not been completely clearing down resulting in accounts hitting a concurrent call limit. A work around has been put in place for this until this evening when we can reload the core device to drop any sessions still in … Continue reading “LON01 – EasyIPT – 13/07/2014 – 20:00 till 21:00 *Planned Maintenance* *COMPLETE*”

We are aware some users have been seeing stuck SIP sessions today, whereby “BYE” messages have not been completely clearing down resulting in accounts hitting a concurrent call limit. A work around has been put in place for this until this evening when we can reload the core device to drop any sessions still in this state.

UPDATE01 – 20:05
This work has started and the server has been requested to reload.

UPDATE02 – 20:15
The server has reloaded and is accepting SIP requests, however is not processing any call routing.

UPDATE03 – 20:35
We have discovered that the Kamailio Proxy service has failed to start which is why calls are not being processed. We are working to restore the service.

UPDATE04 – 20:40
The service has been restored and test calls are now routing OK.