RSS Feed
News
Apr
8
Incident Report for Outage on April 3 2019.
Posted by Kian Ng on 08 April 2019 06:18 PM
Dear customers, 

This report was compiled after detailed interview and investigation with our technical team and software vendor (Virtuozzo.com). Any communication provided prior to this incident report should be disregarded as our team may have provided information that was believed to be accurate at the time of outage but shown to not be so after investigation.



What happened?

(1) A bug in the virtualisation infrastructure running ReadySpace Cloud Servers was discovered.

(2) There are currently 2 versions, v6 and v7, and affected VMs would have been running on either or both.

(3) The issue occurred when v6 VMs were suddenly and unexpectedly unable to connect to storage as observed by engineers. Error : MON ERR MDS#4 died unexpectedly (122): snap: no objtype 12 found.

(4) Support ticket was submitted to software vendor and issue cause was identified as a bug. (Ticket Reference - [Virtuozzo #26185] Ticket Contents - "Object type 12 is MDS_OBJ_KVSTORE. It was added in VZ7.0.8 storage release. I have submitted a bug with internal ID PSBM-93335 to our Developers Team. Regarding this case, for now you is not able to create vz6 MDS in the mixed mode cluster. Workaround - upgrade this vz6 storage node to VZ7.”)

(5) No immediate fix from Virtuozzo (VZ) was available.



What were the issues?

(1) VZ6 and VZ7 were suddenly unable to run in mixed mode after VZ7 recent storage release.

(2) No errors were encountered for 110 days after the initial update to VZ7 until an unexpected error occurred in the early hours of 3rd April. (MON ERR MDS#4 died unexpectedly (122): snap: no objtype 12 found (3)). Immediate action to initiate recovery was required to prevent data loss as many clients had not procured a backup service.



What was issue resolve?

(1) As advised by Virtuozzo, the workaround to this issue was to upgrade VZ6 storage node to VZ7.

(2) As 2 nodes in VZ6 that contained 12TB of data was degraded, we had to add additional storage (at least 4.5TB) into the VZ7 cluster to accomodate the data transfer. Again, this was necessary in order to prevent any form of data lost as many affected clients had not purchased a back up.

(3) Servers were immediately deployed and connected to the VZ7 cluster to allow sync to start. This process would have taken 5 days to complete which meant data would be unavailable to server nodes for 5 days.

(4) Another server node of 6TB was deployed and connected to the VZ7 cluster to speed up the recovery process. Although the amount of data to be replicated was massive, the process was successfully reduced from the initial estimate of 5 days to approximately 20 hours.

(5) Within this 20 hours, data first needed to be duplicated in degraded mode which took about 10hrs. A workaround was then applied by Virtuozzo Support Team to allow data from the disconnected VZ6 to be moved into the VZ7 storage cluster. This took another 10hrs. During this process, some VMs were in degraded mode and may not have been accessible. This was because individual VM data chunks may have been separated between VZ6 and/or VZ7.

(6) Once completed, all VMs were back online, albeit with low IOPS. This was because another VZ6 storage had to be replicated. This meant that, while the servers were online, data replication continued, resulting in the low IOPS.

(7) Once completed, all VMs will be back online with high IOPS.

(8) The last step of this recovery will be to upgrade all VZ6 compute and memory to VZ7 compute and memory.



What is the status now?

(1) Normal service has resumed and customers are able to access the Cloud Servers (VMs).

(2) Customers who are using vz6 compute and memory are schedule to be migrated to vz7 compute and memory.

(3) Software provider (Virtuozzo) is producing a fix to this issue.



Conclusion and next steps:

(1) A maintenance window will be carried out immediately to add on more SSD server nodes for more IOPS.

(2) This issue was caused by a software bug that surfaced unexpectedly between VZ6 and VZ7 storage nodes. Reference of Software Define Storage we use - https://www.virtuozzo.com/products/virtuozzo-storage.html

(3) We will no longer provide mix mode cluster from VZ7 onwards. Customers who will need to have latest updated version will need to migrate their VMs to another cluster.

(4) Although software bugs can occur, regardless of the number of HA ratios, we will be extending our cluster to a higher HA ratio.

(5) We will move our websites from main cluster to an isolated cluster so as to maintain constant communication with our users during service outages.

(6) We will re-evaluate VZ Storage cluster and start exploring alternative software define storage solutions. Eg, Ceph.

(7) We sincerely and unreservedly apologise for the severity of downtime and the extreme inconvenience caused to our customers and their users.

(8) The management team at ReadySpace would like to thank our technical team for working around the clock and the customer service team for managing relentless client queries during the time of outage.



Should you have further queries, please contact us via help@readyspace.com with subject “Outage on 3rd April"


Readyspace Team
Read more »



Apr
8
[Scheduled Maintenance] Updating of firmware of BGP routers - Singapore DC
Posted by Yong Hwang Poh on 08 April 2019 11:04 AM
Dear Customers

In order to serve you better, we will be performing an firmware update on our BGP routers. Details as follow :


Task - Updating of BGP routers
Date - 15th April 2019
Time - 2359 hours to 0030 hours (SG Time)

BGP routers will be rebooted after the update. Short 1 - 2 minute downtime will be expected during this maintenance.

We apologise for any inconvenience this may cause and look forward to providing you with better connectivity and security moving forward.


Best Regards,


ReadySpace Team 

Read more »



Apr
5
Update on Replication and Maintenance
Posted by Kian Ng on 05 April 2019 10:40 PM
Dear customers,

We are pleased to announce that replication work resulting from the recent storage issue has been completed. Normal service for all users should resumed by now. Should you encounter any residual effects like slowness, do let us know via helpdesk.readyspace.com and we will investigate further for you.

Do also kindly note that, as announced earlier, we will be undertaking further maintenance and upgrading works over this weekend. Some users may experience intermittent connectivity during the course of the works.

We will be issuing an Incident Report soon regarding the recent outage. We would like to express our sincerest apologies and our heartfelt gratitude for your patience and understanding.

Thank you.


Readyspace Team
Read more »



Apr
5
Maintenance and Upgrading
Posted by Kian Ng on 05 April 2019 10:02 AM
Dear customers,  

We will be performing maintenance and upgrading works from Friday 5th April 2019, 10pm (2200hrs) SGT, to Sunday 7th April 2019, 6am (0600hrs) SGT. 

Some users on the Cloud Server and Cloud Infrastructure platforms may experience minor disruptions to their service during this maintenance exercise.  

We apologise for any inconvenience that may inadvertently occur.  


ReadySpace Team
Read more »



Apr
5
Urgent Storage Replication
Posted by Sheeran Kee on 05 April 2019 09:32 AM
Dear Customers,

This is an update from our previous news on storage issue. Data has been working on replication for the past 24hrs or so and has reached its last 1%.

Currently, customers with database workloads will experience performance issues. Loading time of application will increase due to this. However services without the need for high IOPS like loading of imagines, files and web services will be service as usual although slight unnoticeable lagged might be felt.

We will provide another update in the next 6 hours.

Thank you.
Read more »



Apr
3
UPDATE: Issue with storage at our Cloud platforms
Posted by Adrian Jiang on 03 April 2019 10:20 PM
Dear customers, 

The software define storage has been restored and now restoring VMs. You might see your VMs running anytime from now. Thank you very much for your patience and continued support with us.

We will update you once all system come back to the normal state.


ReadySpace Team
Read more »




ReadySpace Helpdesk - Giving you space for growth